-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: run insert&delete-heavy workload with intent/GC queue pressure #9540
Labels
C-investigation
Further steps needed to qualify. C-label will change.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
Milestone
Comments
tbg
added
the
S-1-stability
Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
label
Sep 26, 2016
spencerkimball
added
C-investigation
Further steps needed to qualify. C-label will change.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
and removed
S-1-stability
Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
labels
Apr 3, 2017
Assigning myself for triage. |
Closing for #15997, which has more activity. |
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Dec 16, 2017
This PR shows the tooling I used to [stress test the GC queue]. In short, I needed a way to put large amounts of intents on a single range; I didn't particularly care to do this on a multi-node cluster, but I needed to do it efficiently for quick turnaround (and also to prevent the GC queue from cleaning up my garbage faster than I could insert it). This was also a good opportunity to investigate "better" debugging tools and to revisit the `ExternalServer` interface, which historically has been the KV store we once wanted to expose to clients. It has since become internal and is technically slated for removal, but at the same time it has seen continued use. The reasons for keeping (something like it) are: 1. debug running clusters that are potentially wedged due to invalid KV data. Be able to read transaction entries and raw KV columns that are unexpected to the SQL layer. 2. in our testing, create problematic conditions that are unattainable by using the public interfaces (creating artificial GC pressure being one example) I also think that there's a point to be made to add functionality such as being able to force a Range to run garbage collection, etc, though that's out of scope here. In this PR, I've sketched out a TxnCoordSender-level entry point that is tied to a bidirectional streaming connection. This has the advantage that there is a context available the lifetime of which is tied to the connection, which means that `TxnCoordSender` can base its transaction heartbeats based on that (this is not to suggest that we should be running serious transactions through this interface, but it establishes parity and, assuming that `client.NewSender` went through this endpoint instead, TxnCoordSender could be simplified to always use the incoming context). There is more subtlety in this topic since we want to [merge] `TxnCoordSender` and `client.{DB,Txn}` though, so don't take this as a concrete suggestion. What's been more immediately useful is a pretty low-level endpoint that allows evaluating a `BatchRequest` on any given `Replica` (bypassing the command queue, etc) and seeing the results (more controversially and important for `gcpressurizer` is the ability to *execute* these batches, something that's quite dangerous in the wrong hands due to the potential of creating inconsistency and also its insufficient synchronization with splits, etc). I think that's the part worth exploring since it's a universally useful last resort when things go wrong and visibility into on-disk state is desired without shutting down the node. Long story short, I have this code and it's definitely not something to check in, but to discuss. It'd be nice to programmatically test the GC queue in that way, and perhaps randomly "pollute" some of our test clusters in ever-escalating ways, to improve their resilience. [stress test the GC queue]: cockroachdb#9540 [merge]: cockroachdb#16000
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-investigation
Further steps needed to qualify. C-label will change.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
We haven't been running workloads which create significant amounts of GCable data (i.e. old versions; txn records; abort span records), or writes larger ranges/numbers of intents (i.e.
DeleteRange
).There are likely many dark corners here, some of which are already known
The first two seem fairly straightforward at this point, but the last one is pretty involved. In any case, we ought to be running workloads that clearly expose those issues.
The text was updated successfully, but these errors were encountered: