Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: splits can hit command is too large error #25233

Closed
nvanbenschoten opened this issue May 2, 2018 · 23 comments
Closed

storage: splits can hit command is too large error #25233

nvanbenschoten opened this issue May 2, 2018 · 23 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Milestone

Comments

@nvanbenschoten
Copy link
Member

We've seen cases where a split request can hit a command is too large error. For instance, this is what happened in https://forum.cockroachlabs.com/t/trying-to-split-same-key-range/1592, where splits continuously failed with the following error:

I180501 02:24:15.005467 26305083 internal/client/db.go:556 [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] failed batch: command is too large: 87076188 bytes (max: 67108864)

We should figure out how this is possible and if there's anything that needs fixing. Off the top of my head, I'm not even sure exactly what factors result in command growth for splits.

Here are the full set of relevant logs of a single split attempt:

Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r42877]
I180501 02:24:10.652441 26305083 storage/log.go:54  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] Range Event: "split", range: 186, info: updated_desc:<range_id:186 start_key:"\276\213\370\017l&\024\371Z\343*i\371\005{\317\000\211\022_name\000\001\375Rc\230\334\264g+e\373\001c\006z,W" end_key:"\276\214\024\204\245\031\370\037\204\365\374\205\377\370\017l\027\367\002]\022_activity\000\001\371\006\323L]\373\001c\020\301e\240" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:3 store_id:3 replica_id:2 > replicas:<node_id:2 store_id:2 replica_id:3 > next_replica_id:4 > new_desc:<range_id:42877 start_key:"\276\214\024\204\245\031\370\037\204\365\374\205\377\370\017l\027\367\002]\022_activity\000\001\371\006\323L]\373\001c\020\301e\240" end_key:"\276\214\024\204\245\034\313\373\204\314a\343\177\370\017l%\211\022_canonical\000\001\375_k[-\367P\244!\373\001c\006\225\"}" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:3 store_id:3 replica_id:2 > replicas:<node_id:2 store_id:2 replica_id:3 > next_replica_id:4 > 
I180501 02:24:10.652544 26305083 sql/planner.go:338  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal query: 
INSERT INTO system.rangelog (
  timestamp, "rangeID", "storeID", "eventType", "otherRangeID", info
)
VALUES(
  $1, $2, $3, $4, $5, $6
)
I180501 02:24:10.652586 26305083 sql/planner.go:340  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] placeholders: ["2018-05-01 02:24:10.654522612 +0000 UTC" "186" "2" "split" "42877" "{\"UpdatedDesc\":{\"range_id\":186,\"start_key\":\"vov4D2wmFPla4ypp+QV7zwCJEl9uYW1lAAH9UmOY3LRnK2X7AWMGeixX\",\"end_key\":\"vowUhKUZ+B+E9fyF//gPbBf3Al0SX2FjdGl2aXR5AAH5BtNMXfsBYxDBZaA=\",\"replicas\":[{\"node_id\":1,\"store_id\":1,\"replica_id\":1},{\"node_id\":3,\"store_id\":3,\"replica_id\":2},{\"node_id\":2,\"store_id\":2,\"replica_id\":3}],\"next_replica_id\":4},\"NewDesc\":{\"range_id\":42877,\"start_key\":\"vowUhKUZ+B+E9fyF//gPbBf3Al0SX2FjdGl2aXR5AAH5BtNMXfsBYxDBZaA=\",\"end_key\":\"vowUhKUcy/uEzGHjf/gPbCWJEl9jYW5vbmljYWwAAf1fa1st91CkIfsBYwaVIn0=\",\"replicas\":[{\"node_id\":1,\"store_id\":1,\"replica_id\":1},{\"node_id\":3,\"store_id\":3,\"replica_id\":2},{\"node_id\":2,\"store_id\":2,\"replica_id\":3}],\"next_replica_id\":4}}"]
I180501 02:24:10.652766 26305083 sql/table.go:220  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] planner acquiring lease on table 'system.system.rangelog'
I180501 02:24:10.652794 26305083 sql/table.go:220  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] planner acquiring lease on table 'system.public.rangelog'
I180501 02:24:10.655802 26305083 sql/plan.go:338  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] statement INSERT INTO system.public.rangelog("timestamp", "rangeID", "storeID", "eventType", "otherRangeID", info) VALUES ($1, $2, $3, $4, $5, $6) compiled to:
0 insert  () 
0 .into rangelog(timestamp, rangeID, storeID, eventType, otherRangeID, info, uniqueID) () 
0 .default 0 (NULL)[unknown] () 
0 .default 1 (NULL)[unknown] () 
0 .default 2 (NULL)[unknown] () 
0 .default 3 (NULL)[unknown] () 
0 .default 4 (NULL)[unknown] () 
0 .default 5 (NULL)[unknown] () 
0 .default 6 (unique_rowid())[int] () 
1 values  () 
1 .size 7 columns, 1 row () 
1 .row 0, expr 0 ($1)[timestamp] () 
1 .row 0, expr 1 ($2)[int] () 
1 .row 0, expr 2 ($3)[int] () 
1 .row 0, expr 3 ($4)[string] () 
1 .row 0, expr 4 ($5)[int] () 
1 .row 0, expr 5 ($6)[string] () 
1 .row 0, expr 6 (unique_rowid())[int] () 
I180501 02:24:10.655836 26305083 util/mon/bytes_usage.go:624  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal-txn: requesting 10240 bytes from the pool
I180501 02:24:10.655853 26305083 util/mon/bytes_usage.go:585  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal-root: now at 10240 bytes (+10240) - util/mon.(*BoundAccount).Grow:519,util/mon.(*BytesMonitor).increaseBudget:627,util/mon.(*BytesMonitor).reserveBytes:558,util/mon.(*BoundAccount).Grow:519,sql/sqlbase.(*RowContainer).AddRow:272
I180501 02:24:10.655867 26305083 util/mon/bytes_usage.go:585  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal-txn: now at 10240 bytes (+10240) - util/mon.(*BoundAccount).Grow:519,sql/sqlbase.(*RowContainer).AddRow:272,sql.(*valuesNode).startExec:166,sql.startExec.func2:510,sql.(*planVisitor).visit.func1:90
I180501 02:24:10.661211 26305083 sql/exec_log.go:173  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal-exec "" {} "INSERT INTO system.public.rangelog(\"timestamp\", \"rangeID\", \"storeID\", \"eventType\", \"otherRangeID\", info) VALUES ($1, $2, $3, $4, $5, $6)" {$1:"'2018-05-01 02:24:10.654523+00:00'", $2:"186", $3:"2", $4:"'split'", $5:"42877", $6:"'{\"UpdatedDesc\":{\"range_id\":186,\"start_key\":\"vov4D2wmFPla4ypp+QV7zwCJEl9uYW1lAAH9UmOY3LRnK2X7AWMGeixX\",\"end_key\":\"vowUhKUZ+B+E9fyF//gPbBf3Al0SX2FjdGl2aXR5AAH5BtNMXfsBYxDBZaA=\",\"replicas\":[{\"node_id\":1,\"store_id\":1,\"replica_id\":1},{\"node_id\":3,\"store_id\":3,\"replica_id\":2},{\"node_id\":2,\"store_id\":2,\"replica_id\":3}],\"next_replica_id\":4},\"NewDesc\":{\"range_id\":42877,\"start_key\":\"vowUhKUZ+B+E9fyF//gPbBf3Al0SX2FjdGl2aXR5AAH5BtNMXfsBYxDBZaA=\",\"end_key\":\"vowUhKUcy/uEzGHjf/gPbCWJEl9jYW5vbmljYWwAAf1fa1st91CkIfsBYwaVIn0=\",\"replicas\":[{\"node_id\":1,\"store_id\":1,\"replica_id\":1},{\"node_id\":3,\"store_id\":3,\"replica_id\":2},{\"node_id\":2,\"store_id\":2,\"replica_id\":3}],\"next_replica_id\":4}}'"} 7.346 1 ""
I180501 02:24:10.661241 26305083 util/mon/bytes_usage.go:609  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] internal-txn: now at 0 bytes (-10240) - util/mon.(*BoundAccount).Close:494,sql/sqlbase.(*RowContainer).Close:221,sql.(*valuesNode).Close:198,sql.(*insertNode).Close:308,sql.(*planTop).close:401
I180501 02:24:15.005467 26305083 internal/client/db.go:556  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] failed batch: command is too large: 87076188 bytes (max: 67108864)
I180501 02:24:15.005491 26305083 internal/client/txn.go:997  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] failed batch: command is too large: 87076188 bytes (max: 67108864)
I180501 02:24:15.005502 26305083 internal/client/txn.go:604  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] rolling back transaction
I180501 02:24:15.022413 26305083 storage/queue.go:642  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] done 4.90105918s

@nvanbenschoten nvanbenschoten added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-kv-replication Relating to Raft, consensus, and coordination. labels May 2, 2018
@nvanbenschoten nvanbenschoten added this to the 2.1 milestone May 2, 2018
@nvanbenschoten
Copy link
Member Author

@tschottdorf any idea here about what components of a Raft command for an EndTransaction with SplitTrigger can grow with an unbounded size? The only things I see are the start and end keys of the new and old RangeDescriptors and the AbortSpan. Do you think it's realistic that an abort span can grow this large (~100 MB)?

@bdarnell
Copy link
Contributor

bdarnell commented May 4, 2018

Yes, I think it's possible. (and while it's also possible for the range descriptors to be very large if you try to index an unbounded field, we know from the logs provided here that's not the case). Under high-contention workloads (especially those that don't use the savepoint retry protocol), you could end up generating abort span entries at a significant rate.

What's worse is that abort span entries fall into an edge case for splitting: as replicated RangeID-local keys, they count toward the total size of the range for purposes of deciding whether it needs to split, but you can't actually split between them. If you do end up getting enough abort span entires to take up a significant fraction of the range, I think you'd see a lot of split loops as you try to split the regular keys of the range and end up with new ranges that are still oversized because of their abort spans.

@nvanbenschoten
Copy link
Member Author

especially those that don't use the savepoint retry protocol

Is this because we'll be abandoning more transactions instead of seeing them through to completion at higher epochs?

If you do end up getting enough abort span entires to take up a significant fraction of the range, I think you'd see a lot of split loops as you try to split the regular keys of the range and end up with new ranges that are still oversized because of their abort spans.

So is there a good solution here? Just reduce your GC ttl to a point where the number of abort span entries is kept in check?

@bdarnell
Copy link
Contributor

bdarnell commented May 4, 2018

Is this because we'll be abandoning more transactions instead of seeing them through to completion at higher epochs?

Yes (I think; haven't verified this)

So is there a good solution here? Just reduce your GC ttl to a point where the number of abort span entries is kept in check?

Unfortunately the GC TTL won't help here. The TTL for abort span entries is hard-coded to one hour. We need some sort of backpressure when there are too many abort span entries (and maybe also some way to write fewer of them or purge them early).

@tbg
Copy link
Member

tbg commented May 5, 2018

It's unexpected that any workload would leak abort spans at a significant rate, though it's not something that we've looked at and so I wouldn't be extremely surprised. We could at least add a metric that tells you how many abort spans are written vs cleared and we should test it for various workloads.

I agree with Ben's analysis. Once you're in that state, it'll stay in that state until the abort span has been GC'ed. This is also something we should roachtest.

Unfortunately the GC TTL won't help here.

It will help somewhat because it might activate GC earlier (i.e. after an hour as opposed to maybe never) but yeah, it won't help reliably in any sense.

My best stab at a fix is this:

  • find a way to tell that the abort span is too large without relying on a GC run. (Some of the discussion here suggests that abort span GC should happen on its own queue and that we need metrics for the size of the span). For starters, having a split fail because of the too-large abort span could mark the range for a forced GC run.
  • change the semantics of TxnSpanGCThreshold. Right now, it only prevents BeginTransaction but I'm not sure why the check is only there. You'd want it for any transactional command, what it does now seems insufficient.
  • In the GC queue, remove everything older than, say, 20s. This interrupts transactions that are still running at that timestamp, which is unfortunate. We can get fancier and compute a timestamp that leaves at most X mb of abort span around, which is a little more work (because the data isn't laid out in a way that makes this fun).

@debraj-manna
Copy link

debraj-manna commented May 5, 2018

@nvanbenschoten I could not reopen #25151 .So commenting here. The performance degraded again. I have observed this same behavior whenever I increase the command size as you suggested the performance improves for few hours then it started degrading again as you can see from the below image

screen shot 2018-05-05 at 9 49 24 am

I am seeing initiating a split of this range message in logs.

I180505 04:17:58.407243 457585397 storage/replica_command.go:862  [split,n2,s2,r199/5:/Table/54/{3/10101…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-08-30T00:11:56.046999999Z/1010031/1/"_canonical"/6600707803436260843/1525473647534 [r38837]

The issue is very easily reproducible in our schemas and cluster. Whenever I create a new cluster and put load on it it works fine for sometime and then its performance starts degrading.

I have emailed you the debug logs again.

@tbg
Copy link
Member

tbg commented May 5, 2018

@debraj-manna interesting that you can reproduce this easily. We have discussed ways of addressing this problem above in this issue, but we're not sure how you're getting into this situation in the first place.

To help us solve this, would you mind sharing the schema and load generation with us (you could email this if you don't want it to be visible publicly)? That will be immensely helpful in addressing this issue.

@debraj-manna
Copy link

debraj-manna commented May 5, 2018

@tschottdorf - I have emailed you the debug logs [email protected]

Let me know if you have received it. The load generation is little difficult to explain as we are generating via our application.

We have a 5 node cluster in AWS of m4.4xlarge instance type on Ubuntu 14. As I have posted in previous comment the load on the cluster is about 2000 queries / sec. Out of which around 30% is on config_str table. It has mixture of INSERT/UPDATE/SELECT.

@tbg
Copy link
Member

tbg commented May 5, 2018

Thanks @debraj-manna. The logs actually don't contain the schema, so I'd need that separately. If you have a few sample INSERT and UPDATE statements (with example data) and the dump of select * from crdb_internal.node_statement_statistics; after a node has been processing traffic for a bit, that will be helpful.

@tbg tbg added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting and removed S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels May 5, 2018
@debraj-manna
Copy link

debraj-manna commented May 5, 2018

@tschottdorf - I have emailed you sample queries along with the output of select * from crdb_internal.node_statement_statistics;

I have one more query . Could this issue be because of read-write conflicts.

  • we are continually appending to config_str table.
  • and then we are doing a “tail” on it by saying getLastVersionsSince ..

I think that both these queries are touching same rows and the transactions are getting aborted and retried.

If this could be the case is there a workaround for this? We are okay not seeing new writes since the query started - in other words - we are okay not treating read queries towards transaction conflicts.

@debraj-manna
Copy link

@tschottdorf @nvanbenschoten - So in config_str I am having two indices of type below with just the difference in order of first key index

INDEX uts_cid_ot (updatetimestamp, customerid, objecttype),
INDEX uts_desc_cid_ot (updatetimestamp desc, customerid, objecttype),

Do you think this could contribute to this problem?

@debraj-manna
Copy link

Did you observe anything unusual from the logs & schemas?

tbg added a commit to tbg/cockroach that referenced this issue May 8, 2018
In particular, this tracks poisoning aborts, which create abort span
entries (see cockroachdb#25233).

Some refactoring to make it easier to add such counters in the future
was carried out.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 8, 2018
See cockroachdb#25233. The abort span (or transaction span) can become arbitrarily
large, and yet its GC is not driven by a metric. This makes it possible
for a range to become highly GC'able without having it be picked up by
GC.

Even worse, if the abort span is too large we won't be able to split the
range any more; a split *copies* the abort span into the right hand side
(though in theory it could copy only a relevant part and remove these
items from the LHS). This means that such a range can cause stability
problems.

Address this problem in a somewhat inelegant but hopefully workable way:
The SysBytes of the range queues the replica for "aggressive" GC when it
exceeds a set fraction of the max raft command size. In "aggressive"
mode, everything in the local key span older than a minute is cleared
out (in regular operation it's one hour). Doing this *should* interrupt
long-running transactions (though in theory it could let them refresh
their spans) though due to deficiencies in how the TxnGCSpanThreshold is
handled in practice it won't (this needs to be fixed separately).

Release note (bug fix): Prevent formation of large ranges in some workloads
with many aborted transactions.
@tbg
Copy link
Member

tbg commented May 8, 2018

@debraj-manna Thanks for the email, I'll try to take a look tomorrow. For now, I have a PR that experiments with resolving this problem when it occurs (though it likely won't occur in the first place when we've addressed it). It's not ready for prime time though.

@debraj-manna
Copy link

debraj-manna commented May 8, 2018

Thanks @tschottdorf . Also let me know your thoughts on the below query if you manage to look at it tomorrow

I have one more query . Could this issue be because of read-write conflicts.

we are continually appending to config_str table.
and then we are doing a “tail” on it by saying getLastVersionsSince ..
I think that both these queries are touching same rows and the transactions are getting aborted and retried.

If this could be the case is there a workaround for this? We are okay not seeing new writes since the query started - in other words - we are okay not treating read queries towards transaction conflicts.

@nvanbenschoten
Copy link
Member Author

Could this issue be because of read-write conflicts.

This is likely part of the issue, yes.

If this could be the case is there a workaround for this? We are okay not seeing new writes since the query started - in other words - we are okay not treating read queries towards transaction conflicts.

@debraj-manna one way to avoid aborting write transactions when performing these "tail" queries is to perform consistent historical reads. These reads take place at a specified time in the past, and therefore do not interfere with current-time transactions. For intance, you could perform these tail queries 2 seconds in the past, which should be sufficient to avoid aborting writing txns. This is descirbed further in our docs.

@tschottdorf from the query patterns, this sounds like a queue-type workload. I'm curious if we will see issues like this when we create #24236.

@debraj-manna
Copy link

debraj-manna commented May 8, 2018

Thanks @nvanbenschoten . Any thoughts on the below query -

So in config_str I am having two indices of type below with just the difference in order of first key index

INDEX uts_cid_ot (updatetimestamp, customerid, objecttype),
INDEX uts_desc_cid_ot (updatetimestamp desc, customerid, objecttype)

Do you think this could contribute to this problem?

I am trying to look what all workarounds we can take to get around this problem.

@nvanbenschoten
Copy link
Member Author

Do you think this could contribute to this problem?

I don't think that would be a large contributor to this problem, other than that the more indexes you have, the more keys SQL writes will need to perform. If SQL writes need to write to more keys, then they will naturally be a little slower, which means that there's a higher chance that they will be aborted by concurrent reads.

@debraj-manna
Copy link

@nvanbenschoten @tschottdorf - Will snapshot isolation help in this case since we are not worried about new rows written after read transaction start ?

@tbg
Copy link
Member

tbg commented Jun 25, 2018

Based on acute need (#26830) I've revisited this and merged a (likely) mitigation that will be backported to 2.0.x. (#26934).

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018
tbg added a commit to tbg/cockroach that referenced this issue Aug 16, 2018
In particular, this tracks poisoning aborts, which create abort span
entries (see cockroachdb#25233).

Some refactoring to make it easier to add such counters in the future
was carried out.

Release note: None
craig bot pushed a commit that referenced this issue Aug 16, 2018
25355: storage: introduce metrics for intent resolutions r=nvanbenschoten a=tschottdorf

In particular, this tracks poisoning aborts, which create abort span
entries (see #25233).

Some refactoring to make it easier to add such counters in the future
was carried out.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Aug 16, 2018
@tbg
Copy link
Member

tbg commented Aug 16, 2018

Downgrading to S-3, though I still think that #27055 should be prioritized.

@tbg
Copy link
Member

tbg commented Oct 11, 2018

This problem appears well under control with the workaround, and I suspect that #29128 would fix the root cause that causes the abort span to grow large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

4 participants