Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent redis task loss when closing connection while in poll #1733

Merged
merged 2 commits into from
Aug 15, 2023

Conversation

mbierma
Copy link
Contributor

@mbierma mbierma commented May 26, 2023

Summary of issue

The redis poller uses the BRPOP command to pull items from the queue:

BRPOP is a blocking list pop primitive. It is the blocking version of RPOP because it blocks the connection when there are no elements to pop from any of the given lists. An element is popped from the tail of the first list that is non-empty, with the given keys being checked in the order that they are given.

If the channel is closed before the BRPOP response is read, there is a risk that a task will be popped from the queue, but never picked up or restored:

sequenceDiagram
    Worker->>Redis: Worker issues BRPOP command to redis, with N second timeout
    Worker->>Redis: Worker receives warm shutdown request, and closes connection
    Client->>Redis: Client adds task to the task queue
    Redis->>Redis: BRPOP removes task from queue, but fails to respond to worker (connection is closed)
    Worker->>Worker: Worker terminates, but client's task is lost
Loading

Summary of change

If the channel is in the _in_poll state when it is closed (i.e. or waiting for the BRPOP command to return), attempt to read the BRPOP response before closing the channel. This should allow the task to be restored before the channel is closed.

Potentially related issues

@auvipy auvipy self-requested a review May 27, 2023 12:21
@auvipy
Copy link
Member

auvipy commented May 27, 2023

@pawl can you try this?

@pawl
Copy link
Contributor

pawl commented May 29, 2023

@auvipy Like test this as a fix for?: celery/celery#7276

I tried testing this fix on this branch of a minimal celery example: https://github.com/pawl/celery_pyamqp_memory_leak/tree/test_1733

I'm still seeing the issue with redis connection errors eventually causing it to remove the wrong function from on_tick.

We're probably still going to need a fix like this: #1734

@mbierma
Copy link
Contributor Author

mbierma commented May 29, 2023

Ya I think #1733 and #1734 are addressing different issues with redis connections

@auvipy auvipy added this to the 5.3 milestone May 30, 2023
@@ -1079,6 +1079,11 @@ def _purge(self, queue):

def close(self):
self._closing = True
if self._in_poll:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try some integration tests for the changes as well? https://github.com/celery/kombu/blob/main/t/integration/test_redis.py

@dingxiong
Copy link

dingxiong commented Oct 2, 2023

Hi @mbierma @auvipy,

Thanks for fixing this issue. But it looks like this patch does not fix all possible edge cages. Adding a step self._brpop_read() in the close() function will help prevent losing tasks during warm shutdown. But what I observed that it still drop a task if I send a kill -9 command after warm shutdown. The close function won't be executed if the worker controller is forcefully killed.

Do you have any idea what to do in this case?

@thuibr
Copy link

thuibr commented Oct 21, 2024

Hi @mbierma, I am trying to debug #1785. Pardon my ignorance, but why does attempting to read the response before closing the channel allow the task to be restored before the channel is closed?

If the channel is in the _in_poll state when it is closed (i.e. or waiting for the BRPOP command to return), attempt to read the BRPOP response before closing the channel. This should allow the task to be restored before the channel is closed.

@auvipy
Copy link
Member

auvipy commented Oct 23, 2024

you are welcome to come with a better fix which handle the other edge cases as well. contributions are welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants