-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pub/Sub streamingPull subscriber: large number of duplicate messages, modifyAckDeadline calls observed #2465
Comments
@kir-titievsky Do you have a workload that can reliably reproduce this? If you do, could you try removing the flow control (hopefully the workload isn't so large that it crashes your machine)? If the problem goes away, this is probably a dup of #2452; the symptoms are nearly identical. If this doesn't fix the problem, could you share the repro workload with me? EDIT: If the workload is too large to remove flow control, the fix for the linked issue is already in master, so we can to test with that version. Slightly less convenient as we'll need to compile from source. |
The behaviour observed with #2452 was seen with the older non streaming pull implementation (0.21.1-beta). This issue came about when using the newer client library. It might also be worth noting that with the Flow Control set to 1000 max messages on the older library duplicates where not seen, just the stuck messages. |
@robertsaxby That makes sense. I can reproduce the messages getting stuck, but not redelivery. @kir-titievsky I need a little help understanding the spreadsheet. Are all rows for the same message? I assume that FWIW, I have a PR opened to address a potential race condition in streaming. It's concievable that the race condition causes this problem. If you could set up a reproduction, please let me know. |
@pongad You are right on all counts about the spreadsheet. All rows are for the same message, including traffic from several bidi streams, that had been opened at different times. |
@pongad Did a couple experiments:
|
To summarize: This is an issue on the server side. Currently the client library does not have enough information to properly handle ack deadlines. In the immediate term, consider using |
Pubsub team is working on a server-side mitigation for this. The client lib will need to be updated to take advantage of it. Fortunately, this new feature "piggybacks" on an already existing one, so the work on client lib can progress right away. I hope to create a PR for this soon. |
Update: the server side release should happen this week. The feature should be enabled next week. The client (in master) has already been modified to take advantage of this new feature. When the server feature is enabled, we'll test to see how this helps. |
The server-side fix has landed. If you are affected, could you try again and see if you observe fewer duplicates? While we expect the fix to help reduce duplication on older client libs, I'd encourage moving to latest release (v0.30.0) since more fixes has landed during that time. |
We are seeing this behavior currently. We are using Screenshot of stackdriver below: Our receiver always performs either I deleted the subscription and recreated it, and saw the modack calls drop to zero, only to climb back almost instantly to where they were before. Is there anything else that will help you/us troubleshoot this issue? |
Eric, how does the mod ack rate compare to your Publish and ack rates?
On Mon, Dec 11, 2017 at 9:19 AM Eric Martineau ***@***.***> wrote:
We are seeing this behavior currently. We are using v0.30.0-beta of the
pubsub library, and our subscriptions are all set to a 60s ack deadline. We
have a subscription that is currently extended the deadline for over 750K
unacked messages:
Screenshot of stackdriver below:
https://screencast.com/t/KogMzx0q7f5
Our receiver always performs either ack() or nack(), and the occasional
message that sneaks through when symptoms look like this complete within
1s, usually faster.
I deleted the subscription and recreated it, and saw the modack calls drop
to zero, only to climb back almost instantly to where they were before.
Is there anything else that will help you/us troubleshoot this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARrMFhZdtKQEgS0G7y0c3GvvhizKQlzHks5s_WQXgaJpZM4Pk8_U>
.
--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub
|
This is the ack rate during the same window: The publish rate: We have a backlog of 1MM messages. Also, we have three separate identical environments with roughly the same usage, and we're seeing this issue on two of them. A third environment seems unaffected. |
Should have asked this earlier: but is your streamingPull operations < 10
messages/second?
On Mon, Dec 11, 2017 at 2:19 PM Eric Martineau ***@***.***> wrote:
This is the ack rate during the same window:
https://screencast.com/t/FR2utgEau
The publish rate:
https://screencast.com/t/2bT4ZcMkJwG
We have a backlog of 1MM messages. Also, we have three separate identical
environments with roughly the same usage, and we're seeing this issue on
two of them. A third environment seems unaffected.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARrMFjXRxZ81Jhe-5kUCdZ6KFJpGe0Faks5s_YAsgaJpZM4Pk8_U>
.
--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub
<https://cloud.google.com/pubsub/overview>
|
@kir-titievsky |
looks like you are not acking most messages, which would leave them stuck
being modAcked, no?
On Mon, Dec 11, 2017 at 3:32 PM Eric Martineau ***@***.***> wrote:
https://screencast.com/t/Nrlxws3v (the drop at the end is when I deleted
and recreated the topic)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARrMFiq-5_4qqk0oAm1fgABqmP5dKeUjks5s_ZFXgaJpZM4Pk8_U>
.
--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub
<https://cloud.google.com/pubsub/overview>
|
We're either acking or nacking every message. My assumption is that nacking would cause the message to become available for immediately redelivery. |
right. It looks like you are nacking most messages. The question is why.
On Mon, Dec 11, 2017 at 6:14 PM Eric Martineau ***@***.***> wrote:
We're either acking or nacking every message. My assumption is that
nacking would cause the message to become available for immediately
redelivery.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARrMFkAGocawH5OD5S3svQxmCXPO9_4Eks5s_bdAgaJpZM4Pk8_U>
.
--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub
<https://cloud.google.com/pubsub/overview>
|
I've been reading more this morning... and I realized that a FWIW - when we're seeing the hundreds of thousands of modacks (the screenshot I posted above), our code is effectively idle - specifically that our Other than that, is there any way to see the pubsub client's 'deadline extension' calls vs
|
It's a little annoying, but we log to logger named "com.google.cloud.pubsub.v1.MessageDispatcher". If you turn logging to If your code is idle, this sounds like a serious problem. If there are messages pending, they should be scheduled immediately. I have a few guesses.
To rule out (3), could you try running To help diagnose (2), could you share with us how you're creating the |
pongad - I've been trying to prep more intel, but I'll throw out a few more things that might be useful:
For example, there's a single thread for a single subscriber across the entire cluster that's currently "working":
Our logs show no task completion (we log whenever we ack/nack), and stackdriver shows ~90 modack op/s, and ~45 pull op/s. The TRACE logs look like this (I've removed all the
|
@smartytime Thank you for the info! From this, I believe the problem is not with the client lib. Reasoning below: The stack dump of your thread shows that it's waiting for things to do. The executor is definitely not deadlocked. However, our logs seem to disagree: You said
From the [1] The "receipts" feature is added to help with message duplication problem. The client lib sends modacks to let the server know it has received the messages. This might explain why we're seeing a lot of modacks. Unfortunately the receipts are confusing the metric. Lastly, I think |
@pongad - Much thanks. I've been trying to read through the pubsub code so you don't have to explain it all... feel like I'm slowly catching up. Knowing about the receipts helps - in an ideal world where messages are processed within their original lease, should we see roughly a 1/1 modack/message ratio? (one receipt modack per message) What is an appropriate value for There was another thing I was looking into that was confusing me. I see this occasionally in thread dumps for subscriber executor threads. :
Normally, I'd expect something like this:
|
@smartytime I'll answer out of order :) The stack trace looks like credentials being refreshed. Once in a while (I believe one hour), a gRPC connection needs to refresh credentials, then the credentials is cached and usable for another hour etc. By default, we open a few connections, so I think it makes sense that you're seeing this once in a while. Ideally, you'd observe That said, we're not being that inefficient. We keep a distribution of how long you take to ack messages and periodically adjust the stream's deadline to 99.9%th percentile so for the vast majority of the messages, you should get fewer than 2 modacks. Right now if you set the max extension to 5 minutes, we might extend it to more than 5. This is a bug: if you can tolerate it, you can set it to 5 now. I'll work on a fix for this. Note that you should generally set the max extension "unreasonably high" since setting it too low will result in duplicate messages. This feature was actually meant as a stopgap to make sure messages aren't stuck forever if we forget to ack them. |
I found this: |
@alex-kisialiou-sciencesoft You might find this useful: https://developers.google.com/api-client-library/java/apis/pubsub/v1, but you could try the lower level auto-generated client for gRPC as well (see the synchronous pull example https://cloud.google.com/pubsub/docs/pull#pubsub-pull-messages-sync-java) |
We ran into this issue on our GCS Object Change Notification (very small messages) subscribers and repeated it in a test program. After upgrading google-cloud-java from before 0.21.1-beta to after 0.42.0-beta, the subscriber works fine until it falls behind. Once it falls behind, the PubSub server starts sending duplicates about every hour, so that over half of messages are duplicates. The Subscriber can easily fall further and further behind. Therefore, the Subscriber either needs to process messages very quickly, or it needs to cache several hours of message ids to While we wait for a fix to the StreamingPull server API, would it be possible for google-cloud-pubsub library to offer an option to use the Pull API instead of StreamingPull? |
Yonatan, You can find an example of using a synchronous pull here:
https://cloud.google.com/pubsub/docs/pull .
That said, you say that messages get re-delivered every hour. This suggests
a pattern we have not considered. Might you say why you take more than an
hour to acknowledge a message once you get it?
…On Tue, Jun 12, 2018 at 6:56 PM Yonathan Randolph ***@***.***> wrote:
We ran into this issue on our GCS Object Change Notification
<https://cloud.google.com/storage/docs/object-change-notification> (very
small messages) subscribers and repeated it in a test program
<https://github.com/yonran/pubsubfallingbehindbug/>. After upgrading
google-cloud-java from before 0.21.1-beta to after 0.42.0-beta, the
subscriber works fine until it falls behind. Once it falls behind, the
PubSub server starts sending duplicates about every hour, so that over half
of messages are duplicates. The Subscriber can easily fall further and
further behind. Therefore, the Subscriber either needs to process messages
very quickly, or it needs to cache several hours of message ids to ack
duplicates quickly. After contacting GCP support, they told me that this is
a known bug
<https://cloud.google.com/pubsub/docs/pull#dealing-with-large-backlogs-of-small-messages>
with StreamingPull of small messages, and that a workaround is to downgrade
to google-cloud-pubsub 0.21.1-beta.
While we wait for a fix to the StreamingPull server API, would it be
possible for google-cloud-pubsub library to offer an option to use the Pull
API instead of StreamingPull?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARrMFsJUX1rvxFVrMw1xezvfgis_pSY2ks5t8EcegaJpZM4Pk8_U>
.
--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub
<https://cloud.google.com/pubsub/overview>
|
@kir-titievsky, thank you for the synchronous pull sample using the Pull GRPC API directly. However, we have written a number of
My client-side code acknowledges all messages in a matter of seconds (In my RawPubSub.java version, I |
That sounds like a bug somewhere. The re-delivery after an hour, in particular, makes me suspicious. If you could, might you file a separate bug with a reproduction? If not,
An alternative explanation for this behavior is that the acks never succeed. Which might make this a client-side bug. But hard to tell. |
We have experienced similar kind of issue here with the java google-cloud-pubsub lib GA version 1.31.0. We don't get the duplicate messages but the messages seem stuck in the queue even though we send ack back. After restarts the clients, the stuck messages got cleared up. |
@luann-trend What does "stuck" here mean? Are new messages not being processed? |
@pongad The new messages still being processed, only there couple hundreds message keep get redelivered but not able to process for some reason. We have experienced the same issues on 2 different cluster environments after about 4-5 days start using the new Java Google-pubsub client GA version. |
@luann-trend This is interesting. Is it possible that the messages are causing you to throw exception? We catch exception and nack messages automatically, assuming that user code failed to process it. Do you know the duration of time between redelivers? |
Should have been fixed in #3743 |
Large number of duplicate messages is observed with Pub/Sub streamingPull client library using code that pulls from a Pub/Sub subscription and inserts messages into BigQuery with a synchronous blocking operation, with flowControl set to max 500 outstanding messages. See [1] for code.
For the same code, we also observe an excessive number of modifyAckDeadline operations (>> streamingPull message operations). And tracing a single message, we see modifyAcks and Acks in alternating order for the same message (modAck, modAck, Ack, modAck, Ack) [2]. This suggest that the implementation might fail to remove Ack'ed messages from a queue of messages to process and keep re-processing messages already on the client. This also suggests that ack requests may not actually be sent.
[2] https://docs.google.com/spreadsheets/d/1mqtxxm0guZcOcRy8ORG0ri787XLQZNF_FLBujAiayFI/edit?ts=59cac0f0#gid=2139642597
[1]
The text was updated successfully, but these errors were encountered: