Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: run insert&delete-heavy workload with intent/GC queue pressure #9540

Closed
tbg opened this issue Sep 26, 2016 · 2 comments
Closed

stability: run insert&delete-heavy workload with intent/GC queue pressure #9540

tbg opened this issue Sep 26, 2016 · 2 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Sep 26, 2016

We haven't been running workloads which create significant amounts of GCable data (i.e. old versions; txn records; abort span records), or writes larger ranges/numbers of intents (i.e. DeleteRange).

There are likely many dark corners here, some of which are already known

The first two seem fairly straightforward at this point, but the last one is pretty involved. In any case, we ought to be running workloads that clearly expose those issues.

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Sep 26, 2016
@petermattis petermattis added this to the 1.0 milestone Feb 22, 2017
@spencerkimball spencerkimball modified the milestones: Later, 1.0 Apr 3, 2017
@spencerkimball spencerkimball added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Apr 3, 2017
@tbg
Copy link
Member Author

tbg commented May 11, 2017

Assigning myself for triage.

@tbg tbg self-assigned this May 11, 2017
@tbg tbg modified the milestones: Later, 1.2 Sep 21, 2017
@tbg
Copy link
Member Author

tbg commented Nov 16, 2017

Closing for #15997, which has more activity.

@tbg tbg closed this as completed Nov 16, 2017
tbg added a commit to tbg/cockroach that referenced this issue Dec 16, 2017
This PR shows the tooling I used to [stress test the GC queue]. In short, I needed a way to put
large amounts of intents on a single range; I didn't particularly care to do this on a multi-node
cluster, but I needed to do it efficiently for quick turnaround (and also to prevent the GC queue
from cleaning up my garbage faster than I could insert it).

This was also a good opportunity to investigate "better" debugging tools and to revisit the
`ExternalServer` interface, which historically has been the KV store we once wanted to expose to
clients. It has since become internal and is technically slated for removal, but at the same time it
has seen continued use. The reasons for keeping (something like it) are:

1. debug running clusters that are potentially wedged due to invalid KV data. Be able to read
   transaction entries and raw KV columns that are unexpected to the SQL layer.
2. in our testing, create problematic conditions that are unattainable by using the public
   interfaces (creating artificial GC pressure being one example)

I also think that there's a point to be made to add functionality such as being able to force a
Range to run garbage collection, etc, though that's out of scope here.

In this PR, I've sketched out a TxnCoordSender-level entry point that is tied to a bidirectional
streaming connection. This has the advantage that there is a context available the lifetime of which
is tied to the connection, which means that `TxnCoordSender` can base its transaction heartbeats
based on that (this is not to suggest that we should be running serious transactions through this
interface, but it establishes parity and, assuming that `client.NewSender` went through this
endpoint instead, TxnCoordSender could be simplified to always use the incoming context). There is
more subtlety in this topic since we want to [merge] `TxnCoordSender` and `client.{DB,Txn}` though,
so don't take this as a concrete suggestion.

What's been more immediately useful is a pretty low-level endpoint that allows evaluating a
`BatchRequest` on any given `Replica` (bypassing the command queue, etc) and seeing the results
(more controversially and important for `gcpressurizer` is the ability to *execute* these batches,
something that's quite dangerous in the wrong hands due to the potential of creating inconsistency
and also its insufficient synchronization with splits, etc). I think that's the part worth exploring
since it's a universally useful last resort when things go wrong and visibility into on-disk state
is desired without shutting down the node.

Long story short, I have this code and it's definitely not something to check in, but to discuss.
It'd be nice to programmatically test the GC queue in that way, and perhaps randomly "pollute"
some of our test clusters in ever-escalating ways, to improve their resilience.

[stress test the GC queue]: cockroachdb#9540
[merge]: cockroachdb#16000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

3 participants