-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Pebble block cache infinite loop #70154
Comments
Refs: cockroachdb#70154 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None
70155: ccl/backupccl: skip TestRestoreOldVersions r=irfansharif a=jbowens Refs: #70154 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: Jackson Owens <[email protected]>
Stack can be found at the end of: |
Looks like its the |
The restore itself seems to succeed: The test doesn't do much after the restore except run a bunch of select statements to verify the restored state. |
It looks like the test is stuck when trying to tear the test servers down?
I see several goroutines waiting to acquire mutexes. It smells of a deadlock but I'm going to have to rope in someone from KV to help understand better. @irfansharif pinging you as L2, I'm going to try and see if this is reliably reproducible in the meantime:
|
The salient deadlock is the last one, in pebble (the rest is a consequence of it). I'm not sure what is going on there. Perhaps the fact that a shutdown is going on has to do with it, but I'm not sure, and even if so pebble shouldn't just deadlock - handing it back to storage. This might be helpful:
https://gist.github.com/e3c6bdce1940bcae6d5febbc24090237 It's only the goroutines containing "pebble" in the stack. I personally think this one here is up to no good, since it probably has a "shard" related mutex which is what everyone else seems to be blocking on. I could be wrong.
|
Thanks y'all, I'd forgotten the right dance to get the full output. |
Seems like these mutually recursive functions never hit a base case:
It's mostly repeating line 428 -> 434 -> 456 -> 428 in Seems like there's precedence for This was the pr which previously fixed a similar issue: cockroachdb/pebble@c96d043 |
There have been no recent changes within the cache implementation. We suspect this is a long dormant bug that we've so far been unable to reproduce. Removed the release-blocker tag for now. |
The title is a bit misleading. This isn't a deadlock, but an infinite loop in the cache eviction code. Nothing has changed in this code recently, so I agree with removing the release-blocker tag. |
Add hot, cold and test counts to block cache shard. Use `coldCount` to assert that `coldSize` matches expectations. This is intended to help gain more insight out of any future reproductions of the infinite loop encountered in cockroachdb/cockroach#70154
Add sizeCold as a parameter to runHandCold so that stack traces from any future cockroachdb/cockroach#70154 reproductions will include it.
I looked at the code to check if sizes are being updated correctly recently, and didn't find anything either.
There is one code path where this can happen. If someone calls So, the loop can also happen if there's a call to The reason I think the call to |
|
Add hot, cold and test counts to block cache shard. Use `countCold` to assert that `sizeCold` matches expectations. This is intended to help gain more insight out of any future reproductions of the infinite loop encountered in cockroachdb/cockroach#70154
…races Add countCold and sizeCold as parameters to runHandCold so that stack traces from any future cockroachdb/cockroach#70154 reproductions will include their values.
Add hot, cold and test counts to block cache shard. Use `countCold` to assert that `sizeCold` matches expectations. This is intended to help gain more insight out of any future reproductions of the infinite loop encountered in cockroachdb/cockroach#70154
…races Add countCold and sizeCold as parameters to runHandCold so that stack traces from any future cockroachdb/cockroach#70154 reproductions will include their values.
Another reproduction, on the release-21.2 branch. Unfortunately, the release-21.2 branch doesn't have the debugging context added in cockroachdb/pebble#1284.
|
@bananabrick @jbowens thoughts on closing vs. keeping this open? still reproducible? |
Side note: I'm going to attempt to unskip the skipped test that was linked upthread. edit: false alarm - it was unskipped in #70501. |
Closing this out. We haven't observed it in many, many months. |
Here's a couple of recent failures:
https://teamcity.cockroachdb.com//viewLog.html?buildId=3434287&buildTypeId=Cockroach_MergedExtendedCi
https://teamcity.cockroachdb.com//viewLog.html?buildId=3432973&buildTypeId=Cockroach_MergedExtendedCi
I didn't notice anything obvious in the test log.
Jira issue: CRDB-9964
The text was updated successfully, but these errors were encountered: