Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-8894: Bump streams test topic deletion assertion timeout from 30s to 60s #7330

Conversation

stanislavkozlovski
Copy link
Contributor

We have seen rare flakiness in this assertion - all of streams' internal topics would not get deleted within the 30 second window. Increasing to 60 seconds should reduce the occurrence.

…0s to 60s

We have seen rare flakiness in this assertion - all of streams' internal topics would not get deleted within the 30 second window. Increasing to 60 seconds should reduce the occurrence.
@stanislavkozlovski
Copy link
Contributor Author

Related failures were:
https://issues.apache.org/jira/browse/KAFKA-8894
https://issues.apache.org/jira/browse/KAFKA-8895

I had seen both fail in one Jenkins run

Topic.GROUP_METADATA_TOPIC_NAME, intermediateUserTopic);
} else {
cluster.waitForRemainingTopics(30000, INPUT_TOPIC, OUTPUT_TOPIC, OUTPUT_TOPIC_2, OUTPUT_TOPIC_2_RERUN,
cluster.waitForRemainingTopics(60000, INPUT_TOPIC, OUTPUT_TOPIC, OUTPUT_TOPIC_2, OUTPUT_TOPIC_2_RERUN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am always a little "concerned" about bumping timeouts if we don't understand why it actually fails. 30 seconds seems like quite some time.

How long is deleting topics supposed to take? As far as I understand, we send a single request to delete all internal topics via AdminClient to the brokers. Is there a relationship between the expected completion time to delete all topic the the number of topics in the request?

\cc @cmccabe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a relationship to the number of partitions, as far as I'm aware.
I don't believe it is normal to take more than 30 seconds in practice but I can imagine it is possible when the Jenkins workers are overloaded. That's my intuition as to why it failed once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the test and at least one failed test namely testReprocessingFromDateTimeAfterResetWithoutIntermediateUserTopic does not have any internal topics to create, and hence none to delete. So it's not clear if bumping up the timeout would help here.

I'd suggest we first augment the error messages to include the expected topics and the actual topics

@mjsax mjsax added streams tests Test fixes (including flaky tests) labels Sep 13, 2019
@guozhangwang
Copy link
Contributor

I filed another PR related to this ticket: #8208

@mjsax
Copy link
Member

mjsax commented Dec 28, 2022

Closing this PR as the corresponding Jira ticket is marked "resolved".

@mjsax mjsax closed this Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
streams tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants