Pubsub messages always expire after 30 minutes #2622

hairybreeches · 2017-11-14T15:38:46Z

We have the following setup for processing pubsub messages (some of which can require a large amount of processing - up to 2 or 3 hours!):

A pubsub subscription with an acknowledgement deadline of 600s/10 minutes-
A maximum acknowledgement deadline (including extensions) of 8 hours, set using setMaxAckExtensionPeriod on the Subscriber.Builder
A policy to renew the period of the message 5 minutes before expiry, set using setAckExpirationPadding on the Subscriber.Builder

Under these circumstances, I would expect:

The subscriber to accept the message from the queue, with an initial deadline of 10 minutes.
The subscriber to renew the acknowledgement deadline for another 10 minutes every 5 minutes, until:

the process function acks/nacks the message
10 minutes after the process function fails, and the expiry deadline (no longer being renewed) expires
after 8 hours of extensions, the max deadline is reached and the deadline can no longer be extended

What we actually see - the message goes back on the queue after 30 minutes. Tried processing a long-running message three times, every single time the message gets picked back up off the queue by a worker 30 minutes later. I just can't understand why this would be the case. We never mention 30 minutes in our settings, and I can't see 30 minutes in the defaults anywhere (for example the default max extension period is 60 minutes).

It's entirely possible I've completely misunderstood the way the acknowledgement deadline renewal is supposed to work and am barking up the wrong tree entirely, if so deepest apologies, but I'm at my wits' end trying to understand what's going on here!

The text was updated successfully, but these errors were encountered:

michaelbausor · 2017-11-14T17:46:46Z

@pongad Do you know what might be going on here?

pongad · 2017-11-14T23:59:06Z

@hairybreeches You seem to understand most things correctly (see 1a below). There are 2 known issues that might explain this behavior.

The latest release has a bug where if you set the padding to >= 2 seconds, it can cause us to try to send many modify deadline requests quickly. Modifying so often could cause the pubsub server to misbehave. This was fixed by BREAKING pubsub: fix and simplify deadline extension #2604.

One caveat: The linked PR removed the ability for you to set padding time. It is currently set to 5 seconds, which should be more than enough for the request to reach the server. The padding config led to quite a few bugs in the past. In my opinion, it is better to make sure the preset value works well. Please let us know if you think this should work for you.

1a. You said in the issue that you set the deadline 10 minutes. This configures the deadline for messages retrieved using the old "polling pull" endpoint. Pubsub team has asked us to move to "streaming pull". In this new endpoint, the deadline is configured per-connection. Theoretically, you can have two machines pulling messages from the same topic, but one of them sets deadline to 1m and the other to 10m. Currently the deadline is set to 1m (with 5s padding, we extend every 55s). Later, we will make the client lib records how long you take to process messages and adjust accordingly. After this, the lib should automatically optimize the deadline without you having to configure things. In the meantime, this will slightly increase your network usage (we send modify deadlines more often than ideal) but this shouldn't greatly matter: the modify deadline messages are quite small and you don't get billed for it.

The pubsub service has a duplication problem on the server tracked in Pub/Sub streamingPull subscriber: large number of duplicate messages, modifyAckDeadline calls observed #2465. The fix for this is being rolled out this week.

Compared to other issues reported, your workload is much longer-running, so it could be running into an undiagnosed problem. Let's see if the fix for 1 and 2 help first, but I'll keep this issue open for now.

hairybreeches · 2017-11-15T08:28:28Z

@pongad

Thanks for the prompt, detailed, helpful reply.

Agree 100% with more of the timings being calculated by the lib - both the pull deadline and the padding time. It'd be good not to have to think about either too hard and I don't see much need for customisation of either in our case. I think you're right that 5 seconds padding should be more than enough for anyone - ours is only so long because we were worried our issue might be caused by insufficient padding, and so we set it to something outrageously large (in the end, it made no difference whether it was 500ms or 5 minutes).

I'm also really glad to hear about issue number 2 - this explains behaviour we are seeing both in integration tests, and when the message is requeued. Every half an hour, the message is picked back up off the queue by the worker, more and more times every half an hour.

So in order to fix issue 1, I'll change our padding back down to 1s, and I'll wait for the fix for issue 2 to be rolled out to the servers, then I'll re-test this and let you know the result.

edit: It looks like I'll also have to update the client library to v0.29 to take advantage of the fix on the server for issue 2. I'll do this too.

hairybreeches · 2017-11-15T10:27:32Z

Is there any way to find out about pubsub server updates? The only thing I can find is this, which only seems to include large feature/availability rollouts https://cloud.google.com/pubsub/docs/release-notes

pongad · 2017-11-15T22:24:09Z

@hairybreeches The pubsub team is rolling out the update. Currently it's only on some servers though; I'll keep the issue threads posted.

pongad · 2017-11-22T00:16:13Z

@hairybreeches The fix has landed. Could you try again and see if it helps? v0.30.0 is the latest release.

hairybreeches · 2017-11-22T11:50:17Z

@pongad - we've updated to v0.30.0, and queued a long-running job this morning, and I'm afraid that both the message duplication and requeuing issues seem to persist.

This is the log of workers picking up one particular message from the queue (note the filter on message ID, message post time). It's picked up once, then half an hour later another two times, then half an hour later again another four times. The bit after "meters-workers-" corresponds to the k8s pod (one running process/Subscriber per pod), so the duplicates are both across pods/processes and within one pod/process.

The really interesting thing is that, looking back through the logs at the previous attempt to execute this long-running job, it looks like it never actually finished - either with success or failure. This suggests that perhaps the requeue of the message is due to the job being killed or failing in some way. It's possible that this is down to a failure in our code, or the process running out of resources, but we suspect not - we receive no failure logs from our all-encompassing catch block, and the regularity of the failure (the half-hour period remains the same for other long-running jobs) suggests this is some sort of timeout issue.

pongad · 2017-11-23T03:02:41Z

@hairybreeches Thank you for letting us know. I've reported to Pubsub team.

hairybreeches · 2017-11-23T07:50:39Z

@pongad Thanks! This is pretty important to us, so if there's anything diagnostic we can do, we'd be happy to help

hairybreeches · 2017-11-23T11:32:07Z

Ah, it looks like the updates that we tested on 22/11 (server updates and update to v0.30.0) meant we could finish the work. Although the message expiry/duplication still affected us, the work did finish eventually: It was picked up 15 times in total, by different worker threads across three pods, and finished each time:

kir-titievsky · 2017-11-27T15:19:54Z

@hairybreeches You mention "long running jobs" -- in general, Pub/Sub and the examples are not geared towards jobs that take a "short" time (call it under 10 minutes). You can have much longer jobs, but they might take some fiddling with the settings and generally work better with a synchronous API. I'll try to contact you on linked into see if we could look at this more closely and find the right way of managing these jobs.

hairybreeches · 2017-11-27T15:50:30Z

@kir-titievsky my email address is [deleted]

pongad · 2018-01-08T00:52:58Z

I synced up with Kir a while back and this seems to have been resolved. Please reopen if required.

hairybreeches · 2018-01-08T08:23:25Z

@pongad I wasn't aware that this issue had been fixed - has it been fixed now or are you saying it's not a priority to be fixed right now?

pongad · 2018-01-08T23:16:32Z

I must have misunderstood Kir earlier. I'll reach back out to make sure.

kir-titievsky · 2018-01-09T17:17:00Z

So, I think the root of the issue is that streamingPull connections get closed after something like 30 seconds of inactivity and can't survive for longer than 30 minutes. Re-building the connections will lead to elevated duplicate message delivery. The fixes to mitigate that server-side are rolling out in the next couple weeks.

This has two possible implications for the client library:

If we had exposed the parameter that controls the maximum delay between modAcks being sent to the users, we could have had a very simple mitigation for the above condition (e.g. send mod acks no less frequently than every 25 seconds, rather than the 99th ptile).
There is a case for keeping pull based implementation around, particularly, if it's still in the code. Having that as a user configurable fall back might be an easy workaround for tasks that require more than 30 minutes.

kir-titievsky · 2018-07-18T20:39:15Z

@pongad, @danoscarmike -- I think this might be obsolete by now.

danoscarmike · 2018-07-19T00:01:40Z

@hairybreeches I'm going to close this one out but please reply if you are still experiencing the issue.

angadn · 2018-07-24T10:13:52Z

Sorry to reopen this discussion - does this mean https://godoc.org/cloud.google.com/go/pubsub#ReceiveSettings has a now redundant comment, and that 30 minutes is not effectively the ceiling for this value anymore?

pongad · 2018-07-24T18:19:31Z

I think the very long running messages described there is still an issue. I'll open a separate bug for this.

angadn · 2018-07-24T18:44:59Z

Can confirm it is still there

…

Sent from my iPhone

On 24-Jul-2018, at 11:49 PM, Michael Darakananda ***@***.***> wrote: I think the very long running messages described there is still an issue. I'll open a separate bug for this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

hassa687 · 2018-11-13T19:28:13Z

This issue still persists in v 1.50.0

kir-titievsky · 2018-11-15T00:13:50Z

Hassan: I think there is now configurable “maximum extension duration.” Might you be using the default value?

On Tue, Nov 13, 2018 at 11:28 AM Hassan A ***@***.***> wrote: This issue still persists in v 1.50.0 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2622 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARrMFiYJ1Die_0EcCOC6VWzN_ikWEUAhks5uux1WgaJpZM4QdiH0> .

-- Kir Titievsky | Product Manager | Google Cloud Pub/Sub <https://cloud.google.com/pubsub/overview>

michaelbausor added api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue. labels Nov 14, 2017

michaelbausor assigned pongad Nov 14, 2017

pongad closed this as completed Jan 8, 2018

pongad reopened this Jan 9, 2018

danoscarmike added the triaged for GA label Jan 11, 2018

pongad added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Jan 17, 2018

pongad added status: blocked Resolving the issue is dependent on other work. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jan 30, 2018

JustinBeckwith added the 🚨 This issue needs some love. label Jun 8, 2018

danoscarmike closed this as completed Jul 19, 2018

theacodes unassigned pongad Sep 28, 2018

JustinBeckwith assigned pongad Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pubsub messages always expire after 30 minutes #2622

Pubsub messages always expire after 30 minutes #2622

hairybreeches commented Nov 14, 2017

michaelbausor commented Nov 14, 2017

pongad commented Nov 14, 2017

hairybreeches commented Nov 15, 2017 •

edited

Loading

hairybreeches commented Nov 15, 2017

pongad commented Nov 15, 2017

pongad commented Nov 22, 2017

hairybreeches commented Nov 22, 2017

pongad commented Nov 23, 2017

hairybreeches commented Nov 23, 2017

hairybreeches commented Nov 23, 2017

kir-titievsky commented Nov 27, 2017

hairybreeches commented Nov 27, 2017 •

edited

Loading

pongad commented Jan 8, 2018

hairybreeches commented Jan 8, 2018

pongad commented Jan 8, 2018

kir-titievsky commented Jan 9, 2018

kir-titievsky commented Jul 18, 2018

danoscarmike commented Jul 19, 2018

angadn commented Jul 24, 2018

pongad commented Jul 24, 2018

angadn commented Jul 24, 2018 via email

hassa687 commented Nov 13, 2018

kir-titievsky commented Nov 15, 2018 via email

Pubsub messages always expire after 30 minutes #2622

Pubsub messages always expire after 30 minutes #2622

Comments

hairybreeches commented Nov 14, 2017

michaelbausor commented Nov 14, 2017

pongad commented Nov 14, 2017

hairybreeches commented Nov 15, 2017 • edited Loading

hairybreeches commented Nov 15, 2017

pongad commented Nov 15, 2017

pongad commented Nov 22, 2017

hairybreeches commented Nov 22, 2017

pongad commented Nov 23, 2017

hairybreeches commented Nov 23, 2017

hairybreeches commented Nov 23, 2017

kir-titievsky commented Nov 27, 2017

hairybreeches commented Nov 27, 2017 • edited Loading

pongad commented Jan 8, 2018

hairybreeches commented Jan 8, 2018

pongad commented Jan 8, 2018

kir-titievsky commented Jan 9, 2018

kir-titievsky commented Jul 18, 2018

danoscarmike commented Jul 19, 2018

angadn commented Jul 24, 2018

pongad commented Jul 24, 2018

angadn commented Jul 24, 2018 via email

hassa687 commented Nov 13, 2018

kir-titievsky commented Nov 15, 2018 via email

hairybreeches commented Nov 15, 2017 •

edited

Loading

hairybreeches commented Nov 27, 2017 •

edited

Loading