CI Failure (on dedicated nodes): `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True` TimeoutError #8289

abhijat · 2023-01-18T05:46:34Z

Module: rptest.tests.e2e_shadow_indexing_test
Class:  ShadowIndexingWhileBusyTest
Method: test_create_or_delete_topics_while_busy
Arguments:
{
  "short_retention": true
}

https://buildkite.com/redpanda/vtools/builds/5252#0185b99b-8c5e-4e0a-b084-b0ac2f2cebe0

rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True: Summary: TimeoutError('Producer did not finish')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/e2e_shadow_indexing_test.py", line 526, in test_create_or_delete_topics_while_busy
    self.redpanda.wait_until(create_or_delete_until_producer_fin,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1069, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Producer did not finish

There are some earlier closed issues for this test but they do not match the error seen here

The text was updated successfully, but these errors were encountered:

andijcr · 2023-01-27T12:14:54Z

https://buildkite.com/redpanda/vtools/builds/5492#0185ed20-8fbc-492a-b902-3174d5125678

FAIL test: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.S3 (1/41 runs)
  failure at 2023-01-26T15:48:03.990Z: TimeoutError('Producer did not finish')
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/5492#0185ed20-8fbc-492a-b902-3174d5125678

andrwng · 2023-01-28T01:13:54Z

Just noting that one notable difference between the runs posted and a green run in CDT is the presence of warning messages like:

time="2023-01-26T09:41:37Z" level=warning msg="Produced at unexpected offset 216 (expected 211) on partition 17"

that happen very frequently in the KgoVerifierProducer.

jcsp · 2023-01-31T22:52:40Z

The "Produced at unexpected offset" is generally a sign that a test is pointing multiple producers at the same topic: kgo-verifier expects to "own" the topic so that it can predict what offset messages will land at. This isn't necessarily harmful: it just means the consumer can't do such a thorough job of verifying that things landed where they should have done.

Need to dig into the test and check what its intention is.

dotnwat · 2023-02-06T20:07:12Z

https://buildkite.com/redpanda/vtools/builds/5842#018625c0-59f1-46fa-9b0f-adb6adbfb798

FAIL test: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=False.cloud_storage_type=CloudStorageType.S3 (1/66 runs)
  failure at 2023-02-06T13:57:50.916Z: TimeoutError('Producer did not finish')
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/5842#018625c0-59f1-46fa-9b0f-adb6adbfb798

On "unexpected offsets" conditions, the producer will drop out of produceInner and re-enter, to destroy and create a client. However, dropping the reference to the client doesn't synchronously close it, so it's possible for the client to remain alive too long, and perhaps continue to drain its produce channel, disrupting the offset checks in the subsequent call to produceInner. It's not certain that we've seen this happen, but it is one possibility discussed in redpanda-data/redpanda#8289 that is straightforward to eliminate. Related: redpanda-data/redpanda#8289

jcsp · 2023-02-07T20:26:10Z

Looking at build 5842. Only 2 partitions are affected (9 & 12)

The test itself is not running more than one producer. The producer is seeing bad offsets several times and restarting each time. It produces for several minutes without issues before the first case of bad offsets.

A leadership transfer happens about 3 seconds before the producer experiences a problem.

RedpandaService-0-140517044687232/ip-172-31-10-89/redpanda.log:INFO  2023-02-06 09:19:29,416 [shard 1] raft - [group_id:10, {kafka/topic-hibiugndrq/9}] vote_stm.cc:280 - became the leader term: 2
RedpandaService-0-140517044687232/ip-172-31-10-89/redpanda.log:INFO  2023-02-06 09:19:29,732 [shard 3] raft - [group_id:13, {kafka/topic-hibiugndrq/12}] vote_stm.cc:280 - became the leader term: 2

time="2023-02-06T09:19:32Z" level=warning msg="Produced at unexpected offset 2726 (expected 2702) on partition 9"
time="2023-02-06T09:19:32Z" level=warning msg="Produced at unexpected offset 2638 (expected 2616) on partition 12"

The consumer in this test is a random consumer, so not guaranteed to scan everything, but it did not find any invalid reads (or out_of_scope_invalid_reads) at all. If something had been blasting extra messages to the topic I would have expected some of those.

Another odd observation in the producer's logs is that after its initial unexpected offsets, it is regularly getting "cannot assign requested address" errors:

time="2023-02-06T09:24:40Z" level=info msg="Loading offsets for topic topic-hibiugndrq t=-1..."
(partial ListOffsets to broker ip-172-31-3-69:9092 (7) failure: unable to dial: dial tcp 172.31.3.69:9092: connect: cannot assign requested address)

It's tough to know what happened here, as the logs are only at INFO level.
It's very unlikely that this is the result of some rogue producer from another test, because the topic name in use is pseudorandom.

One possible explanation for the recurring "unexpected offsets" (but not the first ones) would be if the producer is not shutting down the franz-go client promptly enough between calls to produceInner. It could still be trying to produce from the previous aborted run while the next run starts. We can tighten this up: redpanda-data/kgo-verifier#20

jcsp · 2023-02-07T22:11:15Z

Looking at the OP build 5252 to see if it has any more clues:

In this case only a single partition is affected
The first "unexpected offset" is about 2:20 into the producer's lifetime
As in the other build, the first unexpected offset happens ~3 seconds after a leadership transfer of the affected partition.

This could be a real bug, although I don't have a great theory where. I don't think it is anything to do with the stated goal of the test (driving topic create+destroy ops under load). The two examples so far have had opposing values of the short_retention parameter, so it's relatively unlikely to be related to cloud retention (since in the "short_retention=False" case, there is no cloud retention kicking in, only short local retention).

The inconsistency that the client is pointing out is between the hwm in a metadata request and the offset in the respond to a subsequent produce request, so if it's a server bug I'd be looking in these places:

a produce response is sending bad offsets (too high), e.g. due to an offset translation bug.
a metadata response is sending bad hwms (too low)

Since this is happening on clustered ducktape and not in dockerized tests, I would guess the data rate is significant. I would not be surprised if this is reproducible by running up a clustered ducktape environment and running this test on a loop.

piyushredpanda · 2023-02-09T16:11:01Z

Assigning to Andrew per weekly discussion

piyushredpanda · 2023-02-15T19:51:14Z

Unassigning Andrew as he is helping with another higher priority project.

jcsp · 2023-02-17T16:31:17Z

TODO: check for overlaps in manifests if the test bundle has cloud_diagnostics.zip file. If so, this could be fixed by #8810

jcsp · 2023-02-21T10:20:53Z

The failure does have a diagnostic bundle, but the failed partitions are not among the ones whose manifests we captured. The manifest dump limit was bumped in https://github.com/redpanda-data/redpanda/pull/8993/files

(edit: I checked all three failures, none of them got lucky enough to have the manifest dump for a stuck partition)

jcsp · 2023-02-21T11:26:06Z

In last 30 days this has only happened in the two cases linked above. The most recent case was 15 days ago. It is plausible that #8810 fixed this, and we will not have any next steps unless it repeats with a more complete diagnostic dump, so closing this for now: we can re-open if it reoccurs with more debug info.

BenPope · 2023-03-15T12:03:02Z

Also seen here on v23.1.2-rc2 CDT: https://buildkite.com/redpanda/vtools/builds/6709#0186e2a2-522c-4604-9b57-712bdf9a8773 are any backports required?

jcsp · 2023-03-23T11:26:11Z

Seen on v22.3.x here: #8289

This issue appears to only be happening on dedicated nodes, where we run with info-level logs.

jcsp · 2023-04-20T09:56:27Z

"Produced at unexpected offset" showing up more reproducibly here #10200 -- may or may not be related to this issue.

abhijat added kind/bug Something isn't working ci-failure labels Jan 18, 2023

jcsp added the area/cloud-storage Shadow indexing subsystem label Jan 20, 2023

jcsp mentioned this issue Feb 7, 2023

kgo-verifier: close client in produceInner redpanda-data/kgo-verifier#20

Merged

jcsp mentioned this issue Feb 7, 2023

tests: update kgo-verifier #8696

Closed

6 tasks

piyushredpanda assigned andrwng Feb 9, 2023

piyushredpanda unassigned andrwng Feb 15, 2023

jcsp closed this as completed Feb 21, 2023

BenPope reopened this Mar 15, 2023

jcsp changed the title ~~CI Failure: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError~~ CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError Mar 23, 2023

mmaslankaprv mentioned this issue Apr 20, 2023

Fixed issue with offset translation that may lead to offset gaps #10232

Merged

7 tasks

piyushredpanda closed this as completed in #10232 Apr 21, 2023

vbotbuildovich mentioned this issue Apr 21, 2023

[v23.1.x] CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError #10273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (on dedicated nodes): `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True` TimeoutError #8289

CI Failure (on dedicated nodes): `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True` TimeoutError #8289

abhijat commented Jan 18, 2023 •

edited

Loading

andijcr commented Jan 27, 2023

andrwng commented Jan 28, 2023

jcsp commented Jan 31, 2023

dotnwat commented Feb 6, 2023

jcsp commented Feb 7, 2023

jcsp commented Feb 7, 2023

piyushredpanda commented Feb 9, 2023

piyushredpanda commented Feb 15, 2023

jcsp commented Feb 17, 2023 •

edited

Loading

jcsp commented Feb 21, 2023 •

edited

Loading

jcsp commented Feb 21, 2023

BenPope commented Mar 15, 2023 •

edited

Loading

jcsp commented Mar 23, 2023

jcsp commented Apr 20, 2023

CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError #8289

CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError #8289

Comments

abhijat commented Jan 18, 2023 • edited Loading

andijcr commented Jan 27, 2023

andrwng commented Jan 28, 2023

jcsp commented Jan 31, 2023

dotnwat commented Feb 6, 2023

jcsp commented Feb 7, 2023

jcsp commented Feb 7, 2023

piyushredpanda commented Feb 9, 2023

piyushredpanda commented Feb 15, 2023

jcsp commented Feb 17, 2023 • edited Loading

jcsp commented Feb 21, 2023 • edited Loading

jcsp commented Feb 21, 2023

BenPope commented Mar 15, 2023 • edited Loading

jcsp commented Mar 23, 2023

jcsp commented Apr 20, 2023

CI Failure (on dedicated nodes): `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True` TimeoutError #8289

CI Failure (on dedicated nodes): `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True` TimeoutError #8289

abhijat commented Jan 18, 2023 •

edited

Loading

jcsp commented Feb 17, 2023 •

edited

Loading

jcsp commented Feb 21, 2023 •

edited

Loading

BenPope commented Mar 15, 2023 •

edited

Loading