-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: inverted range in intervalSkl.AddRange #32149
Comments
|
This panic is coming from here: cockroach/pkg/storage/tscache/interval_skl.go Lines 241 to 243 in 824e668
We're providing an end key to the timestamp cache that sorts before the start key. I looked through the sklImpl logic, specifically around truncating tscache keys when they are too long, but couldn't find anything suspicious looking. That means that this must be coming from cockroach/pkg/storage/replica.go Line 2208 in 824e668
Its very unlikely that we're messing up the span in the |
Never mind, I was mixing up the |
Another unsubstantiated theory: could be a bit flip in the end key, which could easily reverse the ordering. |
Closes cockroachdb#32149 as unactionable past this point. Release note: None
Yeah, it's tough to say on these. I think the best we can do is add more info to this panic (#32397) and see if it ever happens again. |
32397: storage/tscache: improve debug info in panic msg for inverted range r=nvanbenschoten a=nvanbenschoten Closes #32149 as unactionable past this point. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
This just reprod on my roachprod cluster, but sadly I was pulled to hours short of your better panic message patch. @nvanbenschoten anything I should save here before I wipe the cluster? |
Nevermind, it's not like there's any state to recover here. The relevant logs for posterity.
|
The "better panic" actually doesn't log the keys. It should do so at least to the local logs. |
Logging the keys before hitting the panic is a good idea now that we know this reproduces occasionally. I'll do that. In the meantime, I don't think there's much to do with your cluster @danhhz. It is interesting that the query had a LIMIT. That makes it a little more likely that this is an issue with ResumeSpans. |
We just saw this fail in #30886: The short lengths rule out the possibility that the key inversion is caused by |
Informs cockroachdb#32149. Release note: None
32424: storage/tscache: locally log full keys for inverted range Fatal r=nvanbenschoten a=nvanbenschoten Informs #32149. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Another repro: |
So if I'm reading this right, the keys are |
What workload is generating keys like that? Note:
The first byte of these keys differs by a single bit:
I'm not sure if that is adequate evidence of a bit flip. I'd like to see more reproductions of this. |
More repros, you say? #32446: For your convenience: package main
import (
"encoding/hex"
"fmt"
)
func main() {
b1 := []byte("J�†��")
b2 := []byte("�½‰É�")
fmt.Println(b1)
fmt.Println(b2)
}
|
34s to get another repro via a 5 node cluster in
I bet you get a repro every minute on a gceworker. |
PS the error is during a split again:
|
(I'm going to leave this to either @petermattis or @nvanbenschoten to figure out, hopefully straightforward now that we have an instant repro). These keys sure look like garbage to me. |
Awesome, now that we have a path to reproduction, this should be easy to track down. I'll take care of that. |
Ok, there's definitely something funky going on with ResumeSpans and |
Interestingly, we don't see a crazy resume scan from cockroach/pkg/storage/batcheval/cmd_reverse_scan.go Lines 64 to 69 in 04779ba
but it ends up corrupted before we apply it to the timestamp cache. We must be corrupting memory somehow. I tracked the failure back to bd57a2b, which makes a lot of sense. Now we just need to figure out what that's doing wrong. I also think this means that the original sentry crash is unrelated to the rest of these failures. I was beginning to suspect this after seeing how common the panic was becoming. |
Yeah, that does make a lot of sense. It is easy to foul up the memory handling when crossing the Cgo boundary. I'll take a close look at that commit. |
We are using |
|
Doesn't it? I thought |
Ah, I misremembered. |
Fixes cockroachdb#32149. Before this change, it was possible for `DBScanResults.resume_key` to point into memory owned by `mvccScanner`, which went out of scope after `MVCCScan` returned. This allowed for memory corruption when returning the key to Go. This change fixes this corruption by copying the memory to the `DBIterator` before returning, which should have a lifetime which exceeds that of the `DBScanResults`. Release note: None
32492: libroach: ensure correct lifetime for resume_key on reverse iteration r=nvanbenschoten a=nvanbenschoten Fixes #32149. Before this change, it was possible for `DBScanResults.resume_key` to point into memory owned by `mvccScanner`, which went out of scope after `MVCCScan` returned. This allowed for memory corruption when returning the key to Go. This change fixes this corruption by copying the memory to the `DBIterator` before returning, which should have a lifetime which exceeds that of the `DBScanResults`. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Not related to cockroachdb#32149. Before this change, we would treat a scan with limit 0 as a point lookup when updating the timestamp cache. After the change, we no longer update the timestamp cache if a scan had a limit of 0 and never looked at any keys. Release note: None
Not related to cockroachdb#32149. Before this change, we would treat a scan with limit 0 as a point lookup when updating the timestamp cache. After the change, we no longer update the timestamp cache if a scan had a limit of 0 and never looked at any keys. Release note: None
https://sentry.io/cockroach-labs/cockroachdb/issues/752525882/
stopper.go:182: *errors.errorString
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send.func1
The text was updated successfully, but these errors were encountered: