Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Module Fork Dead Lock #766

Open
keithchew opened this issue Dec 30, 2023 · 5 comments
Open

[BUG] Module Fork Dead Lock #766

keithchew opened this issue Dec 30, 2023 · 5 comments

Comments

@keithchew
Copy link

Working on v6.3.4, I have a test module which calls RedisModule_Fork() periodically. I encountered a condition where KeyDB completely freezes, and the logs have stopped logged. Adding some logging, I tracked it down to redisFork() in server.cpp:

    serverLog(LL_WARNING, "aeAcquireForkLock start");
    aeAcquireForkLock();
    serverLog(LL_WARNING, "aeAcquireForkLock finish");

when it freezes, "aeAcquireForkLock start" is printed but not the corresponding finish. Looking into this method, I can see a comment in the code:

            if (exclusive) {
                /* Another thread might have the write lock while we have the internal lock
                but won't be able to release it until they can acquire the internal lock
                so release the internal lock and try again instead of waiting to avoid deadlock */
                while(!m_writeLock.try_lock())
                    m_cv.wait(rm);
            }

The comment says it should release the internal lock and try again, but it looks like it is only trying again without any release?

I am still tracking this down, but if anyone has experienced the same issue, or has any advice on this, happy to take any suggestions to try out.

Note: I have Save "" in the config, so there is no background saving happening. From the logs, I cannot see other places that is trying to call the fork, so this deadlock scenario is a bit strange.

@keithchew
Copy link
Author

keithchew commented Jan 1, 2024

I managed to trigger another dead lock, this time with logging and stack trace:

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/testmodule.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/testmodule.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/testmodule.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]
release exclusive [1] m_writeCount [0]



!!! ERROR: Deadlock detected !!!
        52: (0x7f0fa1e86d70) RocksDBStorageProvider
        70: (0x7f0fa1e3111c) StorageCache
printing backtrace for thread 52
!!! KeyDB Will Now Crash !!!


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
7:52:M 31 Dec 2023 19:27:05.020 # ------------------------------------------------
7:52:M 31 Dec 2023 19:27:05.020 # !!! Software Failure. Press left mouse button to continue
7:52:M 31 Dec 2023 19:27:05.020 # Guru Meditation: Deadlock detected #fastlock.cpp:268

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(fastlock_sleep+0x49f) [0x55fd8943fe6f]
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/testmodule.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/testmodule.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/testmodule.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]

------ INFO OUTPUT ------

It looks like it is coming from this line when the fork happens:

            g_pserver->stat_current_save_keys_total = dbTotalServerKeyCount();

Not sure how to workaround this, if anyone has any input, that'll be much appreciated.

@keithchew
Copy link
Author

OK, I finally tracked it down.

My test module uses the RedisModule_CreateTimer() for periodic callbacks, and when the main thread in KeyDB calls the callback, the module uses another thread to call RedisModule_Fork(). To ensure there is no dead lock in KeyDB, I had had to do 2 things:

- In module, call RedisModule_ThreadSafeContextLock() and RedisModule_ThreadSafeContextUnlock() before/after RedisModule_Fork()
- In KeyDB, skip aeAcquireForkLock() call

I believe some of the standard modules in Redis use the same pattern as the test module, so it can be a gotcha for anyone using Redis modules within KeyDB. If you experience a complete freeze on KeyDB, then likely you have struck this bug. It took over 35,000 calls to aeAcquireForkLock (10-12 days) to trigger the dead lock for me, so not something that is triggered quickly.

Will keep testing and post any updates here.

@keithchew
Copy link
Author

Although the above fixes the deadlock for the scenario above, there is another case where the same deadlock happens:

#865

Note that the proposed fix there might fix this issue.

@keithchew
Copy link
Author

keithchew commented Nov 17, 2024

I had more time to review the code recently, and I have finally tracked this down (with the help of strace and gdb). The freeze is due to a deadlock condition between the server thread and module thread:

- server thread acquired g_forkLock read lock
- module tread acquired s_moduleGIL write lock by moduleAcquireGIL(true) in ThreadSafeContextLock()
- server thread try to acquire s_moduleGIL read lock by moduleAcquireGIL(true)
- module thread try to acquire g_forkLock write lock by aeAcquireForkLock() in redisFork()
- deadlock condition as server thread holds g_forkLock read lock and module thread holds s_moduleGIL write lock!

I have a workaround on my local:

  • make module thread try to acquire the g_forkLock write lock instead of blocking, and handle the condition when the lock cannot be acquired

Hope this helps anyone using modules and having keydb occasionally locking up.

@keithchew
Copy link
Author

I wasn't too happy with the try lock above for g_forkLock, since this can cause the module to miss out fork events. I am now experimenting with using aeAcquireLock() instead of aeAcquireForkLock(). So far, no deadlocks.

Looking at the history of fork lock, it was introduced in 6.3.0 to address rdb background saving issues:
#378
#386
#391

That is some significant amount of code just to do a fork! Not sure why the global lock was not used since fork() is a cheap and fast operation. So for now, just going to use aeAcquireLock() for my module and for other forks, the existing code is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant