-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: splits can hit command is too large error #25233
Comments
@tschottdorf any idea here about what components of a Raft command for an EndTransaction with SplitTrigger can grow with an unbounded size? The only things I see are the start and end keys of the new and old RangeDescriptors and the AbortSpan. Do you think it's realistic that an abort span can grow this large (~100 MB)? |
Yes, I think it's possible. (and while it's also possible for the range descriptors to be very large if you try to index an unbounded field, we know from the logs provided here that's not the case). Under high-contention workloads (especially those that don't use the savepoint retry protocol), you could end up generating abort span entries at a significant rate. What's worse is that abort span entries fall into an edge case for splitting: as replicated RangeID-local keys, they count toward the total size of the range for purposes of deciding whether it needs to split, but you can't actually split between them. If you do end up getting enough abort span entires to take up a significant fraction of the range, I think you'd see a lot of split loops as you try to split the regular keys of the range and end up with new ranges that are still oversized because of their abort spans. |
Is this because we'll be abandoning more transactions instead of seeing them through to completion at higher epochs?
So is there a good solution here? Just reduce your GC ttl to a point where the number of abort span entries is kept in check? |
Yes (I think; haven't verified this)
Unfortunately the GC TTL won't help here. The TTL for abort span entries is hard-coded to one hour. We need some sort of backpressure when there are too many abort span entries (and maybe also some way to write fewer of them or purge them early). |
It's unexpected that any workload would leak abort spans at a significant rate, though it's not something that we've looked at and so I wouldn't be extremely surprised. We could at least add a metric that tells you how many abort spans are written vs cleared and we should test it for various workloads. I agree with Ben's analysis. Once you're in that state, it'll stay in that state until the abort span has been GC'ed. This is also something we should roachtest.
It will help somewhat because it might activate GC earlier (i.e. after an hour as opposed to maybe never) but yeah, it won't help reliably in any sense. My best stab at a fix is this:
|
@nvanbenschoten I could not reopen #25151 .So commenting here. The performance degraded again. I have observed this same behavior whenever I increase the command size as you suggested the performance improves for few hours then it started degrading again as you can see from the below image I am seeing
The issue is very easily reproducible in our schemas and cluster. Whenever I create a new cluster and put load on it it works fine for sometime and then its performance starts degrading. I have emailed you the debug logs again. |
@debraj-manna interesting that you can reproduce this easily. We have discussed ways of addressing this problem above in this issue, but we're not sure how you're getting into this situation in the first place. To help us solve this, would you mind sharing the schema and load generation with us (you could email this if you don't want it to be visible publicly)? That will be immensely helpful in addressing this issue. |
@tschottdorf - I have emailed you the debug logs Let me know if you have received it. The load generation is little difficult to explain as we are generating via our application. We have a 5 node cluster in AWS of |
Thanks @debraj-manna. The logs actually don't contain the schema, so I'd need that separately. If you have a few sample |
@tschottdorf - I have emailed you sample queries along with the output of I have one more query . Could this issue be because of read-write conflicts.
I think that both these queries are touching same rows and the transactions are getting aborted and retried. If this could be the case is there a workaround for this? We are okay not seeing new writes since the query started - in other words - we are okay not treating read queries towards transaction conflicts. |
@tschottdorf @nvanbenschoten - So in
Do you think this could contribute to this problem? |
Did you observe anything unusual from the logs & schemas? |
In particular, this tracks poisoning aborts, which create abort span entries (see cockroachdb#25233). Some refactoring to make it easier to add such counters in the future was carried out. Release note: None
See cockroachdb#25233. The abort span (or transaction span) can become arbitrarily large, and yet its GC is not driven by a metric. This makes it possible for a range to become highly GC'able without having it be picked up by GC. Even worse, if the abort span is too large we won't be able to split the range any more; a split *copies* the abort span into the right hand side (though in theory it could copy only a relevant part and remove these items from the LHS). This means that such a range can cause stability problems. Address this problem in a somewhat inelegant but hopefully workable way: The SysBytes of the range queues the replica for "aggressive" GC when it exceeds a set fraction of the max raft command size. In "aggressive" mode, everything in the local key span older than a minute is cleared out (in regular operation it's one hour). Doing this *should* interrupt long-running transactions (though in theory it could let them refresh their spans) though due to deficiencies in how the TxnGCSpanThreshold is handled in practice it won't (this needs to be fixed separately). Release note (bug fix): Prevent formation of large ranges in some workloads with many aborted transactions.
@debraj-manna Thanks for the email, I'll try to take a look tomorrow. For now, I have a PR that experiments with resolving this problem when it occurs (though it likely won't occur in the first place when we've addressed it). It's not ready for prime time though. |
Thanks @tschottdorf . Also let me know your thoughts on the below query if you manage to look at it tomorrow
|
This is likely part of the issue, yes.
@debraj-manna one way to avoid aborting write transactions when performing these "tail" queries is to perform consistent historical reads. These reads take place at a specified time in the past, and therefore do not interfere with current-time transactions. For intance, you could perform these tail queries 2 seconds in the past, which should be sufficient to avoid aborting writing txns. This is descirbed further in our docs. @tschottdorf from the query patterns, this sounds like a queue-type workload. I'm curious if we will see issues like this when we create #24236. |
Thanks @nvanbenschoten . Any thoughts on the below query -
I am trying to look what all workarounds we can take to get around this problem. |
I don't think that would be a large contributor to this problem, other than that the more indexes you have, the more keys SQL writes will need to perform. If SQL writes need to write to more keys, then they will naturally be a little slower, which means that there's a higher chance that they will be aborted by concurrent reads. |
@nvanbenschoten @tschottdorf - Will snapshot isolation help in this case since we are not worried about new rows written after read transaction start ? |
In particular, this tracks poisoning aborts, which create abort span entries (see cockroachdb#25233). Some refactoring to make it easier to add such counters in the future was carried out. Release note: None
25355: storage: introduce metrics for intent resolutions r=nvanbenschoten a=tschottdorf In particular, this tracks poisoning aborts, which create abort span entries (see #25233). Some refactoring to make it easier to add such counters in the future was carried out. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Downgrading to S-3, though I still think that #27055 should be prioritized. |
This problem appears well under control with the workaround, and I suspect that #29128 would fix the root cause that causes the abort span to grow large. |
We've seen cases where a split request can hit a
command is too large
error. For instance, this is what happened in https://forum.cockroachlabs.com/t/trying-to-split-same-key-range/1592, where splits continuously failed with the following error:We should figure out how this is possible and if there's anything that needs fixing. Off the top of my head, I'm not even sure exactly what factors result in command growth for splits.
Here are the full set of relevant logs of a single split attempt:
The text was updated successfully, but these errors were encountered: