[BUG] Module Fork Dead Lock #766

keithchew · 2023-12-30T05:12:36Z

Working on v6.3.4, I have a test module which calls RedisModule_Fork() periodically. I encountered a condition where KeyDB completely freezes, and the logs have stopped logged. Adding some logging, I tracked it down to redisFork() in server.cpp:

    serverLog(LL_WARNING, "aeAcquireForkLock start");
    aeAcquireForkLock();
    serverLog(LL_WARNING, "aeAcquireForkLock finish");

when it freezes, "aeAcquireForkLock start" is printed but not the corresponding finish. Looking into this method, I can see a comment in the code:

            if (exclusive) {
                /* Another thread might have the write lock while we have the internal lock
                but won't be able to release it until they can acquire the internal lock
                so release the internal lock and try again instead of waiting to avoid deadlock */
                while(!m_writeLock.try_lock())
                    m_cv.wait(rm);
            }

The comment says it should release the internal lock and try again, but it looks like it is only trying again without any release?

I am still tracking this down, but if anyone has experienced the same issue, or has any advice on this, happy to take any suggestions to try out.

Note: I have Save "" in the config, so there is no background saving happening. From the logs, I cannot see other places that is trying to call the fork, so this deadlock scenario is a bit strange.

The text was updated successfully, but these errors were encountered:

keithchew · 2024-01-01T00:59:34Z

I managed to trigger another dead lock, this time with logging and stack trace:

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/testmodule.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/testmodule.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/testmodule.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]
release exclusive [1] m_writeCount [0]



!!! ERROR: Deadlock detected !!!
        52: (0x7f0fa1e86d70) RocksDBStorageProvider
        70: (0x7f0fa1e3111c) StorageCache
printing backtrace for thread 52
!!! KeyDB Will Now Crash !!!


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
7:52:M 31 Dec 2023 19:27:05.020 # ------------------------------------------------
7:52:M 31 Dec 2023 19:27:05.020 # !!! Software Failure. Press left mouse button to continue
7:52:M 31 Dec 2023 19:27:05.020 # Guru Meditation: Deadlock detected #fastlock.cpp:268

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(fastlock_sleep+0x49f) [0x55fd8943fe6f]
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/testmodule.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/testmodule.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/testmodule.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]

------ INFO OUTPUT ------

It looks like it is coming from this line when the fork happens:

            g_pserver->stat_current_save_keys_total = dbTotalServerKeyCount();

Not sure how to workaround this, if anyone has any input, that'll be much appreciated.

keithchew · 2024-01-08T23:00:00Z

OK, I finally tracked it down.

My test module uses the RedisModule_CreateTimer() for periodic callbacks, and when the main thread in KeyDB calls the callback, the module uses another thread to call RedisModule_Fork(). To ensure there is no dead lock in KeyDB, I had had to do 2 things:

- In module, call RedisModule_ThreadSafeContextLock() and RedisModule_ThreadSafeContextUnlock() before/after RedisModule_Fork()
- In KeyDB, skip aeAcquireForkLock() call

I believe some of the standard modules in Redis use the same pattern as the test module, so it can be a gotcha for anyone using Redis modules within KeyDB. If you experience a complete freeze on KeyDB, then likely you have struck this bug. It took over 35,000 calls to aeAcquireForkLock (10-12 days) to trigger the dead lock for me, so not something that is triggered quickly.

Will keep testing and post any updates here.

keithchew · 2024-09-18T06:21:18Z

Although the above fixes the deadlock for the scenario above, there is another case where the same deadlock happens:

#865

Note that the proposed fix there might fix this issue.

keithchew · 2024-11-17T13:23:48Z

I had more time to review the code recently, and I have finally tracked this down (with the help of strace and gdb). The freeze is due to a deadlock condition between the server thread and module thread:

- server thread acquired g_forkLock read lock
- module tread acquired s_moduleGIL write lock by moduleAcquireGIL(true) in ThreadSafeContextLock()
- server thread try to acquire s_moduleGIL read lock by moduleAcquireGIL(true)
- module thread try to acquire g_forkLock write lock by aeAcquireForkLock() in redisFork()
- deadlock condition as server thread holds g_forkLock read lock and module thread holds s_moduleGIL write lock!

I have a workaround on my local:

make module thread try to acquire the g_forkLock write lock instead of blocking, and handle the condition when the lock cannot be acquired

Hope this helps anyone using modules and having keydb occasionally locking up.

keithchew · 2024-11-18T06:13:01Z

I wasn't too happy with the try lock above for g_forkLock, since this can cause the module to miss out fork events. I am now experimenting with using aeAcquireLock() instead of aeAcquireForkLock(). So far, no deadlocks.

Looking at the history of fork lock, it was introduced in 6.3.0 to address rdb background saving issues:
#378
#386
#391

That is some significant amount of code just to do a fork! Not sure why the global lock was not used since fork() is a cheap and fast operation. So for now, just going to use aeAcquireLock() for my module and for other forks, the existing code is used.

keithchew mentioned this issue Dec 3, 2024

[BUG] KeyDB deadlock #883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Module Fork Dead Lock #766

[BUG] Module Fork Dead Lock #766

keithchew commented Dec 30, 2023

keithchew commented Jan 1, 2024 •

edited

Loading

keithchew commented Jan 8, 2024

keithchew commented Sep 18, 2024

keithchew commented Nov 17, 2024 •

edited

Loading

keithchew commented Nov 18, 2024

[BUG] Module Fork Dead Lock #766

[BUG] Module Fork Dead Lock #766

Comments

keithchew commented Dec 30, 2023

keithchew commented Jan 1, 2024 • edited Loading

keithchew commented Jan 8, 2024

keithchew commented Sep 18, 2024

keithchew commented Nov 17, 2024 • edited Loading

keithchew commented Nov 18, 2024

keithchew commented Jan 1, 2024 •

edited

Loading

keithchew commented Nov 17, 2024 •

edited

Loading