-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: v20.1.0-beta.3: received ... results, limit was ... #46652
Comments
cc @nvanbenschoten @andreimatei not sure how concerning this is |
Most concerning, I'd say, since the sentry report already has two events from distinct clusters. Both events show exactly the same query fingerprint; I don't know what to make of that. I'll add it to the release blockers. |
Is there any way to see whether pebble was perhaps used here? That the fingerprint looks the same would suggest to me that this is the same user (perhaps having recreated the cluster). I took a look at the batch evaluation code but nothing jumped out at me, so that I started looking into whether there's maybe a problem with the pebble MVCC scanner. Didn't find anything there neither, but it occurred to me that we ought to be able to tell from a sentry report whether pebble was active. @petermattis that seems to be something storage should look to get into the release. I believe that's achieved by massaging cockroach/pkg/util/errorutil/error.go Lines 48 to 49 in c097a16
|
The other far-fetched possibility I considered is a data race, where DistSender changes MaxSpanRequestKeys when resending a request (note how the stack is purely local, in both events). But this can't be it, we pass the field by value through several layers of DistSender (and besides, don't parallelize limited requests) |
Downgrade the assertion to returning an error to the user, asking nicely for providing their repro on the issue. We keep reporting to sentry despite not terminating the process. Touches cockroachdb#46652. Release justification: low-risk reporting improvement Release note: None
I looked some more and didn't find anything. It's likely that I did break something (or expose something that was previously not exercised). PR #46720 downgrades the assertion (but keeps reporting) to, hopefully, resolve this soon with external input. |
We do have telemetry about Pebble usage. I like the idea of including it in the sentry report too. |
Downgrade the assertion to returning an error to the user, asking nicely for providing their repro on the issue. We keep reporting to sentry despite not terminating the process. Touches cockroachdb#46652. Release justification: low-risk reporting improvement Release note: None
46720: kvserver: improve reporting for an assertion r=andreimatei a=tbg Downgrade the assertion to returning an error to the user, asking nicely for providing their repro on the issue. We keep reporting to sentry despite not terminating the process. Touches #46652. Release justification: low-risk reporting improvement Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Downgrade the assertion to returning an error to the user, asking nicely for providing their repro on the issue. We keep reporting to sentry despite not terminating the process. Touches cockroachdb#46652. Release justification: low-risk reporting improvement Release note: None
Now that #46976 has landed, do we feel comfortable removing it from the release blocker list? Or is there more to do here before the release? |
Judging by the report, it seems likely that this is a bug that users will hit "quickly". If we had randomized testing for read-only batches with limits and it failed to reproduce, I would be comfortable releasing. As is, I worry that the moment 20.1 gets out there someone will hit it and we'll have to rush in. Going to look at randomized testing tomorrow. It shouldn't be too hard to write something quick and dirty that can tell us more. Also will inspect the code paths again. Do you think kvnemeses can be used for this easily? |
If I recall correctly, kvnemesis doesn't currently support ranged operations, so there would be some work to do. Adding such support shouldn't be too difficult though, and it does seem like the right tool for the job. |
Hack up just enough support for scans to see if we can perhaps tickle the release blocker here: cockroachdb#46652 No validation was added. We're just using kvnemeses to send random scans for us. Release note: None
status update: asked @nvanbenschoten to do a code audit pass to see if he can spot a bug in my refactors. If that fails to turn up anything I suggest we release. |
I don't have anything new to report here, but I did just find #40958, which may be relevant. |
Nothing is jumping out at me here. RocksDB and Pebble's implementation of MVCCScan both look good. So does My only suggestions are that you:
We know from the stacktrace that the request was a BatchRequest with a |
Oh, good points. I'll look into it tomorrow
…On Tue, Apr 7, 2020, 20:32 Nathan VanBenschoten ***@***.***> wrote:
Nothing is jumping out at me here. RocksDB and Pebble's implementation of
MVCCScan both look good. So does evaluateBatch.
My only suggestions are that you:
1. adapt #47120 <#47120>
to ReverseScan
2. hack around
https://github.com/cockroachdb/cockroach/pull/47120/files#diff-e04ef583219215b6b7c19c4312e683dcR171
to assign a limit to batches
We know from the stacktrace that the request was a BatchRequest with a
MaxSpanRequestKeys (and TargetBytes) consisting of either all
ScanRequests or all ReverseScanRequests, so we should test all of those
cases.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#46652 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZDP36DO4IRAQLJHL4DRLNWVXANCNFSM4LUSVQMQ>
.
|
Hmm, in #40958 the user sort of claims that they were running with faulty memory? I don't trust that very much, but it's good to see that we saw this issue well before I made changes in this area. This is further evidence that it should not block the release. Will look into your suggestions now. |
I found more real bugs after putting in your suggestions, see #47120, though no repro for this one. |
Not really for this issue (here we're looking at a read-only batch) but what RevertRange does is super suspicious because that command isn't even MVCC aware, and doesn't want the usual semantics. It will happily go through the whole keyspace range by range until it hits its limit on one of the ranges. Then, once it's hit the limit, subsequent requests in the same batch still clear the first key (above this code). cockroach/pkg/kv/kvserver/batcheval/cmd_revert_range.go Lines 104 to 116 in 6c7b80e
Lines 81 to 90 in f176a39
|
Looks like RC2 is still affected: #48121 |
Oh nice. Ok, here's the rub: using rocksdb, and
We received 7 results, wanted 3, the original limit was also 3 (duh), and the batch was just a single Very surprising. The report was triggered by a long SELECT statement. |
There are a dozen or so repros from this, all on young clusters (start-single-node), so it's likely one person/entity able to hit this bug and they just do it over and over. The details are always either the above one or this (which is same just with different numbers):
The statements triggering this are:
The second one is also what triggered it in the first report in this thread. |
I must say this is weird. Single forward scans are the least likely to hit any kind of bug here. The three clusters that generated all the reports have had no involvement in any other sentry reports (i.e. they don't have busted disks or something like that). I hope the error, which links to this issue, motivates the person running those queries to come forward and file a report here. |
Looked at the code again and no idea how this would happen except for a bug at the MVCC layer or below. I was hoping that somehow, magically this could be caused by the internal read retry here but after injecting an artificial retry there it didn't pan out (just as reading the code suggested it would not). |
@itsbilal seems to have figured it out (or at least found a plausible bug). 🙇♂️ and handing over. |
We weren't checking for MaxKeys (or TargetBytes) being reached in the case where we read from intent history in the MVCC scanner. All other cases go through addAndAdvance(), which had these checks. Almost certainly fixes cockroachdb#46652. Would be very surprised if it was something else. Release note (bug fix): Fixes a bug where a read operation in a transaction would error out for exceeding the maximum count of results resturned.
46992: sql: Add Logical Column ID field to ColumnDescriptor r=rohany a=RichardJCai The LogicalColumnID field mimics the ColumnID field however LogicalColumnID may be swapped between two columns whereas ColumnID cannot. LogicalColumnID is referenced for virtual tables (pg_catalog, information_schema) and most notably affects column ordering for SHOW COLUMNS. This LogicalColumnID field support swapping the order of two columns - currently only used for ALTER COLUMN TYPE when a shadow column is created and swapped with it's original column. Does not affect existing behaviour. Release note: None 47449: cli: add --cert-principal-map to client commands r=petermattis a=petermattis Add support for the `--cert-principal-map` flag to the certs and client commands. Anywhere we were accepting the `--certs-dir` flag, we now also accept the `--cert-principal-map` flag. Fixes #47300 Release note (cli change): Support the `--cert-principal-map` flag in the `cert *` and "client" commands such as `sql`. 48138: keys: support splitting Ranges on tenant-id prefixed keys r=nvanbenschoten a=nvanbenschoten Fixes #48122. Relates to #47903. Relates to #48123. This PR contains a series of small commits that work towards the introduction of tenant-id prefixed keyspaces and begin the removal of some `keys.TODOSQLCodec` instances. This should be the only time we need to touch C++ throughout this work. 48160: storage,libroach: Check for MaxKeys when reading from intent history r=itsbilal a=itsbilal We weren't checking for MaxKeys (or TargetBytes) being reached in the case where we read from intent history in the MVCC scanner. All other cases go through addAndAdvance(), which had these checks. Almost certainly fixes #46652. Would be very surprised if it was something else. Release note (bug fix): Fixes a bug where a read operation in a transaction would error out for exceeding the maximum count of results returned. 48162: opt: add rule to eliminate Exists when input has zero rows r=rytaft a=rytaft This commit adds a new rule, `EliminateExistsZeroRows`, which converts an `Exists` subquery to False when it's known that the input produces zero rows. Informs #47058 Release note (performance improvement): The optimizer can now detect when an Exists subquery can be eliminated because the input has zero rows. This leads to better plans in some cases. Co-authored-by: richardjcai <[email protected]> Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Rebecca Taft <[email protected]>
Backport of cockroachdb#48160. We weren't checking for MaxKeys (or TargetBytes) being reached in the case where we read from intent history in the MVCC scanner. All other cases go through addAndAdvance(), which had these checks. Almost certainly fixes cockroachdb#46652. Would be very surprised if it was something else. Release note (bug fix): Fixes a bug where a read operation in a transaction would error out for exceeding the maximum count of results returned.
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://sentry.io/organizations/cockroach-labs/issues/1583480205/?referrer=webhooks_plugin
Panic message:
Stacktrace (expand for inline code snippets):
cockroach/pkg/kv/kvserver/replica_read.go
Lines 94 to 96 in fcd74cd
cockroach/pkg/kv/kvserver/replica_read.go
Lines 66 to 68 in fcd74cd
cockroach/pkg/kv/kvserver/replica_send.go
Lines 235 to 237 in fcd74cd
cockroach/pkg/kv/kvserver/replica_send.go
Lines 97 to 99 in fcd74cd
cockroach/pkg/kv/kvserver/replica_send.go
Lines 35 to 37 in fcd74cd
cockroach/pkg/kv/kvserver/store_send.go
Lines 203 to 205 in fcd74cd
cockroach/pkg/kv/kvserver/stores.go
Lines 187 to 189 in fcd74cd
cockroach/pkg/server/node.go
Lines 924 to 926 in fcd74cd
cockroach/pkg/util/stop/stopper.go
Lines 301 to 303 in fcd74cd
cockroach/pkg/server/node.go
Lines 912 to 914 in fcd74cd
cockroach/pkg/server/node.go
Lines 950 to 952 in fcd74cd
cockroach/pkg/rpc/context.go
Lines 536 to 538 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/transport.go
Lines 198 to 200 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/transport.go
Lines 167 to 169 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1629 to 1631 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 459 to 461 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 541 to 543 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1400 to 1402 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1085 to 1087 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 736 to 738 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go
Lines 85 to 87 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go
Lines 45 to 47 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go
Lines 125 to 127 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go
Lines 224 to 226 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go
Lines 182 to 184 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
Lines 223 to 225 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go
Lines 104 to 106 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go
Lines 171 to 173 in fcd74cd
cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
Lines 503 to 505 in fcd74cd
cockroach/pkg/kv/db.go
Lines 738 to 740 in fcd74cd
cockroach/pkg/kv/txn.go
Lines 893 to 895 in fcd74cd
cockroach/pkg/sql/row/kv_batch_fetcher.go
Lines 183 to 185 in fcd74cd
cockroach/pkg/sql/row/kv_batch_fetcher.go
Lines 314 to 316 in fcd74cd
cockroach/pkg/sql/row/kv_batch_fetcher.go
Lines 398 to 400 in fcd74cd
cockroach/pkg/sql/row/kv_fetcher.go
Lines 86 to 88 in fcd74cd
cockroach/pkg/sql/row/fetcher.go
Lines 597 to 599 in fcd74cd
cockroach/pkg/sql/row/fetcher.go
Lines 587 to 589 in fcd74cd
cockroach/pkg/sql/row/fetcher.go
Lines 483 to 485 in fcd74cd
cockroach/pkg/sql/rowexec/tablereader.go
Lines 162 to 164 in fcd74cd
cockroach/pkg/sql/execinfra/processorsbase.go
Lines 747 to 749 in fcd74cd
cockroach/pkg/sql/flowinfra/flow.go
Lines 369 to 371 in fcd74cd
cockroach/pkg/sql/distsql_running.go
Lines 377 to 379 in fcd74cd
cockroach/pkg/sql/distsql_running.go
Lines 987 to 989 in fcd74cd
cockroach/pkg/sql/conn_executor_exec.go
Lines 867 to 869 in fcd74cd
cockroach/pkg/sql/conn_executor_exec.go
Lines 765 to 767 in fcd74cd
cockroach/pkg/sql/conn_executor_exec.go
Lines 470 to 472 in fcd74cd
cockroach/pkg/sql/conn_executor_exec.go
Lines 93 to 95 in fcd74cd
cockroach/pkg/sql/conn_executor.go
Lines 1393 to 1395 in fcd74cd
cockroach/pkg/sql/conn_executor.go
Lines 1322 to 1324 in fcd74cd
cockroach/pkg/sql/conn_executor.go
Lines 477 to 479 in fcd74cd
cockroach/pkg/sql/pgwire/conn.go
Lines 593 to 595 in fcd74cd
v20.1.0-beta.3
go1.13.5
The text was updated successfully, but these errors were encountered: