-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Client with shared subscription is blocked #21104
Comments
@michalcukierman Could you please share the topic stats and internal-stats when the issue is reproduced? |
Two files attached: I see that in stats we have:
, but I do ack all of the messages: Consumer consumer = client.newConsumer()
.topic(sourceTopic)
.subscriptionInitialPosition(SubscriptionInitialPosition.Earliest)
.subscriptionName(subscriptionName)
.subscriptionType(SubscriptionType.Shared)
.receiverQueueSize(8)
.ackTimeout(ackTimeout, TimeUnit.SECONDS)
.subscribe();
Producer<byte[]> producer = client.newProducer(Schema.BYTES)
.topic(destinationTopic)
.compressionType(CompressionType.LZ4)
.maxPendingMessages(8) // Support 5 MB files
.blockIfQueueFull(true) // Support 5 MB files
.batchingMaxBytes(5242880)
.create();
Multi<Messagemessages = Multi.createBy().repeating()
.completionStage(consumer::receiveAsync)
.until(m -> closed.get());
messages.subscribe().with(msg -> {
receivedDistribution.record(getKiloBytesSize(msg.getData()));
Uni.createFrom().completionStage(producer.newMessage(Schema.BYTES)
.key(msg.getKey())
.value(msg.getValue().getContent().getPayload().toByteArray())
.eventTime(msg.getEventTime()).sendAsync())
.subscribe().with(msgId -> {
sentCounter.increment();
try {
consumer.acknowledge(msg.getMessageId());
} catch (PulsarClientException e) {
throw new RuntimeException("Unable to send or ACK the message", e);
}
});
});
} askTimeout is set to 120 seconds. It can be confirmed in the log files, where sentCounter is just on message behind |
Here are the alternative stats: |
I think it happens:
There may be a race condition in 3.1.0 client, as the situation was not observed with 2.10.4 (we've downgraded, also @MichalKoziorowski-TomTom reported this as a fix). |
We don't have ackTimeout and it still happens. In my case, my topic has only one partition. |
you said 1 million messages right? Could you post your sample payload? I want to reproduce and debug. |
It's 1 mln x 30 kb of sample HTML file. Any text file would be good.
|
Here is the exact payload: |
+1,it seems also occurred in other consumer mode(shared,keyshared) |
Happened today again on different module. topic - not partitioned, without retention The solution is to use 2.10.4 client for now. |
I'll try to reproduce it and get back to you. :) |
HI, @michalcukierman I created a repo to reproduce this issue. But no luck. |
@mattisonchao |
how about fix deadline ? |
I've tried to create isolated environment to reproduce the issue on local machine using test containers, but I was not able today.
It also happens on randomly, with 1 million messages It sometimes get stuck around 300k-500k messages, but it is not deterministic. I'll try to reproduce the issue on GCP, but I'll need more time. It may happen that more than one partition on more than one brokers are required. Unfortunately with the test containers I am not able to recreate the same load. |
It looks like I was able to reproduce the issue in the two runs today (failed 2/2).
The code is here: In general it's very much like in the bug description. Produce 1 mln messages of 30kb: @Outgoing("requests-out")
public Multi<String> produce() {
return Multi.createBy().repeating()
.uni(() -> Uni
.createFrom()
.item(() -> RandomStringUtils.randomAlphabetic(30_000))
.onItem()
.invoke(() -> System.out.println("+ Produced: " + outCount.incrementAndGet()))
)
.atMost(1_000_000);
} Read it using client with shared subscription and write to another topic: @ApplicationScoped
public class Processor {
private final AtomicLong inCount = new AtomicLong(0);
@Incoming("requests-in")
@Outgoing("dump-out")
@Blocking
PulsarOutgoingMessage<String> process(PulsarIncomingMessage<String> in) {
System.out.println(" - Processed: " + inCount.incrementAndGet());
return PulsarOutgoingMessage.from(in);
}
} The settings of the client are:
The retention of the topic |
I left a comment here, and you can answer the comment under current Issue, Thanks |
Both issues may not be related, In both cases the subscriptions are blocked, but in this case the restart of the broker didn't help - it looks like a deadlock in the client. @poorbarcode have you tried the repository I've linked? It's possible to reproduce it on GCP cluster. Should work on other clusters as well. |
It might be fixed with #22352. Need to check... |
@michalcukierman Could you recheck your case? I've checked my case and I can't reproduce with 3.0.4 client while it was easily reproducible with 3.0.1. Test with version that includes #22352 fix. |
I could not reproduce the issue with Client 3.0.4, I can still reproduce the issue with Client 3.2.2 |
I've noticed that with Client 3.2.2 the behavior may be a bit different. The consumers get blocked, but occasionally resume. After a couple messages received are stuck again. I've confirmed once again, the issue does not occur with Pulsar Client 3.0.4 (or at least I was not able to reproduce it after processing 1 mln messages. With Client 3.2.2 usually clients are blocked after 50k messages). |
@poorbarcode have you had chance to see the last comment? |
I check the stats and the internal-stats, Which means all the messages are unacked, or, in the backlog. I guess the configuration Which means there are toooo many unacked messages on the consumer so that the consumer stopped dispatch messages to clients. Please ack message after process successfully. |
The whole source code with the instructions on how to run it is available in the linked repository. The message acknowledgment should be done after getting write confirmation from the producer. This is the original code that was failing. We are no longer using it because we've decommissioned the module.: #21104 was created as a reproducible example. |
This code creates a pipeline with the source-processor-sink. Sink ACK triggers source ACK.
|
@michalcukierman I'm not familiar with this kind of development framework, #21104 (comment) is based on the stats and internal-stats you provided. The point is: all the messages dispatched to clients are not acked. You can debug your code to confirm the messages are acked or not |
Search before asking
Version
Client - 3.1.0
Pulsar - 3.1.0 (and later builds)
Also reported on 3.0.1
Minimal reproduce step
My reproducible steps:
What did you expect to see?
All messages are received
What did you see instead?
Client stops to receive messages, restart client helps, but it get stuck after some time.
Anything else?
The issue was originally created described here: #21082
@MichalKoziorowski-TomTom also faces the issue.
I've created new issue, because it in #21082 the author says that broker restart helps. In case of this issue, it looks like it's client related and some race condition observed in 3.x.x. after introducing
ackTimeout
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: