Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError #8289

Closed
abhijat opened this issue Jan 18, 2023 · 14 comments · Fixed by #10232
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working

Comments

@abhijat
Copy link
Contributor

abhijat commented Jan 18, 2023

Module: rptest.tests.e2e_shadow_indexing_test
Class:  ShadowIndexingWhileBusyTest
Method: test_create_or_delete_topics_while_busy
Arguments:
{
  "short_retention": true
}

https://buildkite.com/redpanda/vtools/builds/5252#0185b99b-8c5e-4e0a-b084-b0ac2f2cebe0

rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True: Summary: TimeoutError('Producer did not finish')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/e2e_shadow_indexing_test.py", line 526, in test_create_or_delete_topics_while_busy
    self.redpanda.wait_until(create_or_delete_until_producer_fin,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1069, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Producer did not finish

There are some earlier closed issues for this test but they do not match the error seen here

@abhijat abhijat added kind/bug Something isn't working ci-failure labels Jan 18, 2023
@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label Jan 20, 2023
@andijcr
Copy link
Contributor

andijcr commented Jan 27, 2023

https://buildkite.com/redpanda/vtools/builds/5492#0185ed20-8fbc-492a-b902-3174d5125678

FAIL test: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.S3 (1/41 runs)
  failure at 2023-01-26T15:48:03.990Z: TimeoutError('Producer did not finish')
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/5492#0185ed20-8fbc-492a-b902-3174d5125678

@andrwng
Copy link
Contributor

andrwng commented Jan 28, 2023

Just noting that one notable difference between the runs posted and a green run in CDT is the presence of warning messages like:

time="2023-01-26T09:41:37Z" level=warning msg="Produced at unexpected offset 216 (expected 211) on partition 17"

that happen very frequently in the KgoVerifierProducer.

@jcsp
Copy link
Contributor

jcsp commented Jan 31, 2023

The "Produced at unexpected offset" is generally a sign that a test is pointing multiple producers at the same topic: kgo-verifier expects to "own" the topic so that it can predict what offset messages will land at. This isn't necessarily harmful: it just means the consumer can't do such a thorough job of verifying that things landed where they should have done.

Need to dig into the test and check what its intention is.

@dotnwat
Copy link
Member

dotnwat commented Feb 6, 2023

https://buildkite.com/redpanda/vtools/builds/5842#018625c0-59f1-46fa-9b0f-adb6adbfb798

FAIL test: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=False.cloud_storage_type=CloudStorageType.S3 (1/66 runs)
  failure at 2023-02-06T13:57:50.916Z: TimeoutError('Producer did not finish')
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/5842#018625c0-59f1-46fa-9b0f-adb6adbfb798

jcsp added a commit to redpanda-data/kgo-verifier that referenced this issue Feb 7, 2023
On "unexpected offsets" conditions, the producer
will drop out of produceInner and re-enter, to
destroy and create a client.

However, dropping the reference to the client doesn't
synchronously close it, so it's possible for the
client to remain alive too long, and perhaps continue
to drain its produce channel, disrupting the offset checks
in the subsequent call to produceInner.

It's not certain that we've seen this happen, but it is one
possibility discussed in redpanda-data/redpanda#8289
that is straightforward to eliminate.

Related: redpanda-data/redpanda#8289
@jcsp
Copy link
Contributor

jcsp commented Feb 7, 2023

Looking at build 5842. Only 2 partitions are affected (9 & 12)

The test itself is not running more than one producer. The producer is seeing bad offsets several times and restarting each time. It produces for several minutes without issues before the first case of bad offsets.

A leadership transfer happens about 3 seconds before the producer experiences a problem.

RedpandaService-0-140517044687232/ip-172-31-10-89/redpanda.log:INFO  2023-02-06 09:19:29,416 [shard 1] raft - [group_id:10, {kafka/topic-hibiugndrq/9}] vote_stm.cc:280 - became the leader term: 2
RedpandaService-0-140517044687232/ip-172-31-10-89/redpanda.log:INFO  2023-02-06 09:19:29,732 [shard 3] raft - [group_id:13, {kafka/topic-hibiugndrq/12}] vote_stm.cc:280 - became the leader term: 2

time="2023-02-06T09:19:32Z" level=warning msg="Produced at unexpected offset 2726 (expected 2702) on partition 9"
time="2023-02-06T09:19:32Z" level=warning msg="Produced at unexpected offset 2638 (expected 2616) on partition 12"

The consumer in this test is a random consumer, so not guaranteed to scan everything, but it did not find any invalid reads (or out_of_scope_invalid_reads) at all. If something had been blasting extra messages to the topic I would have expected some of those.

Another odd observation in the producer's logs is that after its initial unexpected offsets, it is regularly getting "cannot assign requested address" errors:

time="2023-02-06T09:24:40Z" level=info msg="Loading offsets for topic topic-hibiugndrq t=-1..."
(partial ListOffsets to broker ip-172-31-3-69:9092 (7) failure: unable to dial: dial tcp 172.31.3.69:9092: connect: cannot assign requested address)
  • It's tough to know what happened here, as the logs are only at INFO level.
  • It's very unlikely that this is the result of some rogue producer from another test, because the topic name in use is pseudorandom.

One possible explanation for the recurring "unexpected offsets" (but not the first ones) would be if the producer is not shutting down the franz-go client promptly enough between calls to produceInner. It could still be trying to produce from the previous aborted run while the next run starts. We can tighten this up: redpanda-data/kgo-verifier#20

@jcsp jcsp mentioned this issue Feb 7, 2023
6 tasks
@jcsp
Copy link
Contributor

jcsp commented Feb 7, 2023

Looking at the OP build 5252 to see if it has any more clues:

  • In this case only a single partition is affected
  • The first "unexpected offset" is about 2:20 into the producer's lifetime
  • As in the other build, the first unexpected offset happens ~3 seconds after a leadership transfer of the affected partition.

This could be a real bug, although I don't have a great theory where. I don't think it is anything to do with the stated goal of the test (driving topic create+destroy ops under load). The two examples so far have had opposing values of the short_retention parameter, so it's relatively unlikely to be related to cloud retention (since in the "short_retention=False" case, there is no cloud retention kicking in, only short local retention).

The inconsistency that the client is pointing out is between the hwm in a metadata request and the offset in the respond to a subsequent produce request, so if it's a server bug I'd be looking in these places:

  • a produce response is sending bad offsets (too high), e.g. due to an offset translation bug.
  • a metadata response is sending bad hwms (too low)

Since this is happening on clustered ducktape and not in dockerized tests, I would guess the data rate is significant. I would not be surprised if this is reproducible by running up a clustered ducktape environment and running this test on a loop.

@piyushredpanda
Copy link
Contributor

Assigning to Andrew per weekly discussion

@piyushredpanda
Copy link
Contributor

Unassigning Andrew as he is helping with another higher priority project.

@jcsp
Copy link
Contributor

jcsp commented Feb 17, 2023

TODO: check for overlaps in manifests if the test bundle has cloud_diagnostics.zip file. If so, this could be fixed by #8810

@jcsp
Copy link
Contributor

jcsp commented Feb 21, 2023

The failure does have a diagnostic bundle, but the failed partitions are not among the ones whose manifests we captured. The manifest dump limit was bumped in https://github.com/redpanda-data/redpanda/pull/8993/files

(edit: I checked all three failures, none of them got lucky enough to have the manifest dump for a stuck partition)

@jcsp
Copy link
Contributor

jcsp commented Feb 21, 2023

In last 30 days this has only happened in the two cases linked above. The most recent case was 15 days ago. It is plausible that #8810 fixed this, and we will not have any next steps unless it repeats with a more complete diagnostic dump, so closing this for now: we can re-open if it reoccurs with more debug info.

@jcsp jcsp closed this as completed Feb 21, 2023
@BenPope
Copy link
Member

BenPope commented Mar 15, 2023

Also seen here on v23.1.2-rc2 CDT: https://buildkite.com/redpanda/vtools/builds/6709#0186e2a2-522c-4604-9b57-712bdf9a8773 are any backports required?

@BenPope BenPope reopened this Mar 15, 2023
@jcsp jcsp changed the title CI Failure: ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError CI Failure (on dedicated nodes): ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True TimeoutError Mar 23, 2023
@jcsp
Copy link
Contributor

jcsp commented Mar 23, 2023

Seen on v22.3.x here: #8289

This issue appears to only be happening on dedicated nodes, where we run with info-level logs.

@jcsp
Copy link
Contributor

jcsp commented Apr 20, 2023

"Produced at unexpected offset" showing up more reproducibly here #10200 -- may or may not be related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants