-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True
TimeoutError
#8289
Comments
https://buildkite.com/redpanda/vtools/builds/5492#0185ed20-8fbc-492a-b902-3174d5125678
|
Just noting that one notable difference between the runs posted and a green run in CDT is the presence of warning messages like:
that happen very frequently in the KgoVerifierProducer. |
The "Produced at unexpected offset" is generally a sign that a test is pointing multiple producers at the same topic: kgo-verifier expects to "own" the topic so that it can predict what offset messages will land at. This isn't necessarily harmful: it just means the consumer can't do such a thorough job of verifying that things landed where they should have done. Need to dig into the test and check what its intention is. |
https://buildkite.com/redpanda/vtools/builds/5842#018625c0-59f1-46fa-9b0f-adb6adbfb798
|
On "unexpected offsets" conditions, the producer will drop out of produceInner and re-enter, to destroy and create a client. However, dropping the reference to the client doesn't synchronously close it, so it's possible for the client to remain alive too long, and perhaps continue to drain its produce channel, disrupting the offset checks in the subsequent call to produceInner. It's not certain that we've seen this happen, but it is one possibility discussed in redpanda-data/redpanda#8289 that is straightforward to eliminate. Related: redpanda-data/redpanda#8289
Looking at build 5842. Only 2 partitions are affected (9 & 12) The test itself is not running more than one producer. The producer is seeing bad offsets several times and restarting each time. It produces for several minutes without issues before the first case of bad offsets. A leadership transfer happens about 3 seconds before the producer experiences a problem.
The consumer in this test is a random consumer, so not guaranteed to scan everything, but it did not find any invalid reads (or Another odd observation in the producer's logs is that after its initial unexpected offsets, it is regularly getting "cannot assign requested address" errors:
One possible explanation for the recurring "unexpected offsets" (but not the first ones) would be if the producer is not shutting down the franz-go client promptly enough between calls to produceInner. It could still be trying to produce from the previous aborted run while the next run starts. We can tighten this up: redpanda-data/kgo-verifier#20 |
Looking at the OP build 5252 to see if it has any more clues:
This could be a real bug, although I don't have a great theory where. I don't think it is anything to do with the stated goal of the test (driving topic create+destroy ops under load). The two examples so far have had opposing values of the The inconsistency that the client is pointing out is between the hwm in a metadata request and the offset in the respond to a subsequent produce request, so if it's a server bug I'd be looking in these places:
Since this is happening on clustered ducktape and not in dockerized tests, I would guess the data rate is significant. I would not be surprised if this is reproducible by running up a clustered ducktape environment and running this test on a loop. |
Assigning to Andrew per weekly discussion |
Unassigning Andrew as he is helping with another higher priority project. |
TODO: check for overlaps in manifests if the test bundle has cloud_diagnostics.zip file. If so, this could be fixed by #8810 |
The failure does have a diagnostic bundle, but the failed partitions are not among the ones whose manifests we captured. The manifest dump limit was bumped in https://github.com/redpanda-data/redpanda/pull/8993/files (edit: I checked all three failures, none of them got lucky enough to have the manifest dump for a stuck partition) |
In last 30 days this has only happened in the two cases linked above. The most recent case was 15 days ago. It is plausible that #8810 fixed this, and we will not have any next steps unless it repeats with a more complete diagnostic dump, so closing this for now: we can re-open if it reoccurs with more debug info. |
Also seen here on v23.1.2-rc2 CDT: https://buildkite.com/redpanda/vtools/builds/6709#0186e2a2-522c-4604-9b57-712bdf9a8773 are any backports required? |
ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True
TimeoutErrorShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True
TimeoutError
Seen on v22.3.x here: #8289 This issue appears to only be happening on dedicated nodes, where we run with info-level logs. |
"Produced at unexpected offset" showing up more reproducibly here #10200 -- may or may not be related to this issue. |
https://buildkite.com/redpanda/vtools/builds/5252#0185b99b-8c5e-4e0a-b084-b0ac2f2cebe0
There are some earlier closed issues for this test but they do not match the error seen here
The text was updated successfully, but these errors were encountered: