-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize: add a EnsureSafeSplitKey in Decider #42056
Conversation
Commit: It is found that the split and read-write are separated when the transaction is concurrent. For example, the data key is /Table/68/1/10. The endkey boundary of the range is exactly /Table/68/1/10/0. The previous range does not contain /Table/68/1/10/0, the next range is included. But the scan request sends startKey /Table/68/1/10, endKey /Table/68/1/10/0 in the previous range. The key of the put request is /Table/68/1/10/0. In the latter range, the read and write is separated, so it is optimized in the decider.go. In addition to optimizing the split, the test found that the panic of #41630 has not been triggered so far, so guessing that this modification may also fix the problem described by #39794 #36834 #36356. Fixes #39794 #36834 #36356 Release note (bug fix): Modified code in #42056 submission.
@petermattis, mind taking a look at this? I don't quite understand the description, but the code suggests that load-based splits may return a split key that falls in the middle of a SQL row. As far as I can tell, we want to call cockroach/pkg/storage/client_split_test.go Lines 280 to 282 in 8b9f547
I'm also unsure why we don't allow rows to be split between columns. I get that we want to avoid it, but is this fixing a serious bug or just a silly thing that can happen? I found no answers in cockroach/pkg/storage/engine/mvcc.go Lines 2995 to 2998 in 35bdbf0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what badness will happen from splitting inside a row. We do call keys.EnsureSafeSplitKey
on AdminSplit
requests when no split-point is provided (see MVCCFindSplitKey
). Rather than making a change inside the split decider, I'd do this enforcement in Replica.adminSplitWithDescriptor
.
@danhhz Do you recall if there are correctness problems from splitting in the middle of row? This can only happen when there are multiple column families (I think).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)
Indeed it could only happen if there are multiple column families in a row, but note that (edit: most) system tables (even in new clusters) still have 1 column family per row for backward compatibility. I recall at some point that SQL depended on this, but I've long forgotten where that was and it's reasonably likely that's changed anyway. No idea who else to tag in here... @jordanlewis do you know who is most up to date on how sql execution uses column families? Also cc @dt does bulk io care if a row is split between two ranges? |
Column families? I'll continue this tag-fest by saying that that's a @rohany beacon if I ever did hear one! |
BulkIO counts rows by checking when I'm not sure if there are any other places we bake that assumption in -- there easily could be. Uniqueness enforcement comes to mind: I think we rely on the fact that the default family will collide to ensure we don't inject a non-unique row, and I'm not quite sure how much we can use that assumption if the default family KV might not be in the same SST/range. |
Aside from relying on the uniqueness checks upon inserting, I don't think that SQL really cares what ranges the different families live in. The rowfetcher/cfetcher just put together the rows from the k/v family pieces as the underlying fetchers give them kv's without caring what range they come from. |
The uniqueness checks on inserting don't depend on a row being contained in a single range, right? |
woops -- I saw part of davids comment above and got confused. It doesn't depend on the range at all. |
Fetchers appeared before because of the column families problem, it is similar to this issue:#38577. we try to avoid issue 38577, when the table is built, it is a single column family. but the panic of unset EncDatum will still appear in the poc test. I just turned off splitbyload when testing, After that, this panic problem no longer appears. You can find that the normal split key does not end with /0 by “select * from crdb_internal.ranges;”command. However, when spiltbyload is turned on, there is a splitkey at the end of /0, so I think it may be related to the key of splitbyload. |
@maweike, this seems like you have a way to reproduce this at will, would you be able to share the details of your reproduction with us? From what you're saying it seems likely that the load based split is a problem, but I want to make sure we understand the implications of it all the way through the stack. @petermattis it's worth pointing out that AdminSplit initially had EnsureSafeSplitKey, but that this was deliberately undone: #16344. @bdarnell was driving this, so he should chime in. At the very least, we ought to prevent AdminSplit from splitting between rows. It should probably get the same "is this SQL" detection that But this is also a good opportunity to figure out what causes the problems of a mid-row split at the SQL layer. It seems that nobody can tell why it would be severely problematic to do so. |
Oh - and this test I was looking at cockroach/pkg/storage/client_split_test.go Line 282 in 8b9f547
actually calls |
@tbg The scenario we simulated was a commercial scenario under stress testing,and the test software used was an intranet commercial software, so I am sorry that I can't share the detailed process of reproduction, but I can confirm that the problem occurred when the data that I am reading is at the boundary of the split range, which will cause the range of reads and writes to be inconsistent, but when the data I read is inside the range, it will not cause the read and write range to be inconsistent. |
@rohany Sorry, I meant uniqueness enforcement in BulkIO and specifically IMPORT INTO's direct injection of rows, and what happens when the default family KV gets put in a different SST than the rest of the row -- that wasn't something considered when we built it. FWIW, I think it should actually be OK since we'd still get the collision later when we go to add that SST and still rollback. The row counter will still certainly be wrong though. |
I'll take a look at it today or tomorrow! |
So i have a repro of the panic --
at line 73 on current master (right after
I have no clue why this happens though. I'm not sure why you have to wait for the panic to occur -- are the range splits happening asynchronously? show ranges immediately after the split call shows multiple ranges, so I'm not sure what waiting for 30 seconds is doing. |
Ok, the cause for the panic is pretty simple -- it seems that the fetchers assume that when a range boundary is hit, all the kv's for a row have been retrieved already. They don't explicitly make this assumption, but it comes out here in
The underlying KVBatchFetcher returns false as the first argument when it hits the end of a range, which causes the fetcher to think that this row is done being processed -- a log on the above example confirms this:
Here, the first part of the row |
Aha! This is exactly what I was thinking of. I wish I would have remembered the specifics and saved you some time, but thanks for looking into it! |
IIRC there was not originally a severe issue here. We avoid splitting mid-row because it might make it harder to count rows (you can no longer to per-range counts and sum them), and it's a likely performance problem (since the column families of a single row are much more likely to be updated in the same transaction than neighboring rows. It also makes things much harder to explain; today we can say that single-row PK-only writes never need distributed transactions).
So 👍 to @maweike 's change, although I'd like to see a new/updated test that demonstrates the problem this solves.
Or if we keep AdminSplit as-is we should just delete the test. |
@bdarnell I'd rather fix this in AdminSplit if we can but it looks like you're not so optimistic that this is possible. Can we tell from looking at the key itself whether it is a column family KV pair? I assume we can't, right? If the table is really
If the int encoded at the end of the key is zero, it can't split a row and we don't need to massage it, right? Taking a step back, I'm quite unhappy about the state of affairs here. We need to uphold some unclear invariant in KV and have no real way of doing so. Any new use of It does seem that this PR is actually the reasonable thing to do here in the short term, though I'd like to explore what can be done here more generally. Not seeing many options, though. |
I brushed up this PR over in
#42833.
|
I'm neither optimistic nor pessimistic here - I don't have any of the relevant encoding details paged in. I agree that we should be able to statically construct an appropriate split key.
Right, this is the problem.
That sounds right. So appending a zero instead of removing a component should result in a "safe" split key. There's a possible risk of unhelpfully producing splits at both K and |
We could end up splitting between column families of the same row, which is illegal. Unfortunately the KV layer has to uphold invariants here that it doesn't quite have introspection into, but after this commit it hopefully stops breaking them. See cockroachdb#16344 for some additional history. Possibly the solution for cockroachdb#39794. Possibly the solution for cockroachdb#36834. Possibly the solution for cockroachdb#36356. (Intentionally not closing the above; leaving that to the SQL folks). Closes cockroachdb#42056 (which is the go-to for reading up on this issue). Release note (bug fix): prevent a number of panics from the SQL layer caused by an invalid range split. These would usually manifest with messages mentioning encoding errors ("found null on not null column" but also possibly various others).
42833: storage: call EnsureSafeSplitKey during load-based splits r=bdarnell a=tbg We could end up splitting between column families of the same row, which is illegal. Unfortunately the KV layer has to uphold invariants here that it doesn't quite have introspection into, but after this commit it hopefully stops breaking them. See #16344 for some additional history. Possibly the solution for #39794. Possibly the solution for #36834. Possibly the solution for #36356. (Intentionally not closing the above; leaving that to the SQL folks). Closes #42056 (which is the go-to for reading up on this issue). Release note (bug fix): prevent a number of panics from the SQL layer caused by an invalid range split. These would usually manifest with messages mentioning encoding errors ("found null on not null column" but also possibly various others). Co-authored-by: Tobias Schottdorf <[email protected]>
We could end up splitting between column families of the same row, which is illegal. Unfortunately the KV layer has to uphold invariants here that it doesn't quite have introspection into, but after this commit it hopefully stops breaking them. See cockroachdb#16344 for some additional history. Possibly the solution for cockroachdb#39794. Possibly the solution for cockroachdb#36834. Possibly the solution for cockroachdb#36356. (Intentionally not closing the above; leaving that to the SQL folks). Closes cockroachdb#42056 (which is the go-to for reading up on this issue). Release note (bug fix): prevent a number of panics from the SQL layer caused by an invalid range split. These would usually manifest with messages mentioning encoding errors ("found null on not null column" but also possibly various others).
For example, the data key is /Table/68/1/10. The endkey boundary of the range is exactly /Table/68/1/10/0. The previous range does not contain /Table/68/1/10/0, the latter one. Range contains. But the scan request sends startKey /Table/68/1/10, endKey /Table/68/1/10/0 in the previous range. The key of the put request is /Table/68/1/10/0. In the next range, the read and write is separated.
Release note: None