-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: test GC queue in presence of many intents #15997
Comments
@tschottdorf should this move to 1.2? or later? |
1.2 will do for now. |
I wrote a little tool that fires up a local single node and writes expired txn entries into its liveness range. I haven't tested how the GC queue reacts yet, but the
This means that the GC queue will try to push 13k transactions the next time it runs. Each of the transactions holds 100 intents, so after that it will try to resolve 13k*100 = 1.3 million intents. If a cluster can remain available with this going on on its liveness range, I think we're fairly safe. I'm sure it won't fare too well when I test this tomorrow. |
cc @nvanbenschoten wrt #18199. |
Awesome @tschottdorf! Being able to easily create this pathological condition in a cluster is a big first step towards addressing it. I'd be curious to hear your ideas towards improving our ability to perform large-scale garbage collection once we're in this state. |
Manual testing in cockroachdb#15997 surfaced that one limiting factor in resolving many intents is contention on the transaction's abort cache entry. In one extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends very large batches of intent resolution requests for the same transaction to the intent resolver. These requests all overlapped on the transaction's abort cache key, causing very slow progress, and ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in cockroachdb#18199 it seemed well worth addressing, by means of 1. allow intent resolutions to not touch the abort span 2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key if it is actually going to be accessed. With these changes, the gc queue was able to clear out a million intents comfortably on my older 13" MacBook (single node). Also use this option in the intent resolver, where possible -- most transactions don't receive abort cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid adding artificial contention there, though in many situations the same intent is resolved and so a conflict still exists. Migration: a new field number was added to the proto and the old one preserved. We continue to populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the deprecated field. I believe this is sufficient and we can just remove all traces of the old field in v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. I think it'll be nontrivial to max out these budgets in practice, but I can definitely do it in intentionally evil tests, and it's good to know that there is some rudimentary form of memory accounting in this queue.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Manual testing in cockroachdb#15997 surfaced that one limiting factor in resolving many intents is contention on the transaction's abort cache entry. In one extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends very large batches of intent resolution requests for the same transaction to the intent resolver. These requests all overlapped on the transaction's abort cache key, causing very slow progress, and ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in cockroachdb#18199 it seemed well worth addressing, by means of 1. allow intent resolutions to not touch the abort span 2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key if it is actually going to be accessed. With these changes, the gc queue was able to clear out a million intents comfortably on my older 13" MacBook (single node). Also use this option in the intent resolver, where possible -- most transactions don't receive abort cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid adding artificial contention there, though in many situations the same intent is resolved and so a conflict still exists. Migration: a new field number was added to the proto and the old one preserved. We continue to populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the deprecated field. I believe this is sufficient and we can just remove all traces of the old field in v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
@danhhz you mentioned the other day that you were interested in getting "store directory fixtures" in addition to the backup-based fixtures. This would be a candidate for that. Is there an issue I should link here? Moving to 2.1 since that's where it will realistically come together. For now, I'm reasonably sure that the badness I was able to create with the experiments above is no more. |
We're still figuring out what the tooling around that will look like but the 2tb backup test already uses them cockroach/pkg/cmd/roachtest/backup.go Lines 32 to 46 in 6f1ba76
|
Great! Optimistically moving this back to 2.0, then. |
We have done little testing on this case.
The text was updated successfully, but these errors were encountered: