Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Background saving not working with 6.2.1 #378

Closed
Talkabout opened this issue Nov 24, 2021 · 34 comments
Closed

[BUG] Background saving not working with 6.2.1 #378

Talkabout opened this issue Nov 24, 2021 · 34 comments

Comments

@Talkabout
Copy link

Describe the bug

Background saving seems to not work after update from 6.0.18 to 6.2.1

To reproduce

Not sure about that. I updated to new version and restarted 2 servers (active-active replica). After initial sync there has been no background saving at all (since 3 days) Before that this task was executed every 5 minutes...

image

image

Expected behavior

Background saving should work :)

Additional information

Using 2 raspberry pis with active-active replication.

@Talkabout
Copy link
Author

As an additional information, these are the settings in conf file I am using for background saving:

image

@VivekSainiEQ
Copy link
Contributor

Hi @Talkabout,

I unfortunately cannot seem to replicate this on my end, what do your config files look like?

@Talkabout
Copy link
Author

Hi @VivekSainiEQ,

these are my settings:

bind 192.168.XX.XX protected-mode no port 6379 timeout 0 tcp-keepalive 0 daemonize no supervised systemd pidfile /var/run/redis/redis-server.pid loglevel notice syslog-enabled yes syslog-ident keydb databases 4 always-show-logo no save 900 1 save 300 100 save 60 10000 stop-writes-on-bgsave-error no rdbcompression yes rdbchecksum no dbfilename dump.rdb dir /var/lib/redis repl-diskless-sync no replica-priority 100 maxmemory 512M maxmemory-policy allkeys-lru lazyfree-lazy-eviction no lazyfree-lazy-expire no lazyfree-lazy-server-del no replica-lazy-flush no appendonly no appendfilename "appendonly.aof" appendfsync everysec no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes aof-use-rdb-preamble yes lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 16 latency-monitor-threshold 0 notify-keyspace-events "" hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-size -2 list-compress-depth 0 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 stream-node-max-bytes 4096 stream-node-max-entries 100 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit replica 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 dynamic-hz yes aof-rewrite-incremental-fsync yes rdb-save-incremental-fsync yes active-replica yes replicaof 192.168.XX.XX 6379 server-threads 2

As you can see I have migrated from redis and using the same config file with specific keydb options added.

thanks for your help!

Bye

@sts
Copy link

sts commented Dec 14, 2021

@VivekSainiEQ

We've got the very same issue with KeyDB. I reported it to the community forum two weeks ago: https://community.keydb.dev/t/keydb-rdb-bgsave-never-competes/189

Is there anything else we can provide to debug the issue?

@Talkabout
Copy link
Author

Thanks @sts, I killed the running processes and immediately the background saving took place. Concurrent background savings are also working as of now, but I am assuming that at some point it will stop again (the background process will hang). Would be nice to have somebody looking into that.

Thanks!

@Talkabout
Copy link
Author

And now one of the keydb servers has a hanging background saving process again:

image

the other one is still executing the background saving...

@Talkabout
Copy link
Author

And here we have the second hanging process:

image

There is surely an issue somewhere with the new version of KeyDB. Any other information needed for you guys to analyze it?

@VivekSainiEQ
Copy link
Contributor

Hi @Talkabout and @sts,

Do BGSAVEs always hang or can they run multiple times before hanging? And if so how many times does it take before hanging? I suspect there is a bug in how having systemd supervision enabled interacts with the BGSAVE mechanism.

@Talkabout
Copy link
Author

Talkabout commented Dec 17, 2021

Hi @VivekSainiEQ,

the first server executed 7 background savings before it got stuck, the second one executed 18 of them. I have now disabled systemd supervision in my config file and restarted both keydb instances. Let's see if it helps.
Thanks for taking a look at the issue!

@Talkabout
Copy link
Author

At the moment background saving runs without problems:

image

Will keep you guys posted about status.

@Talkabout
Copy link
Author

Unfortunately:

image

even after removing the supervisor systemd from the config. Any other idea?

@Talkabout
Copy link
Author

Another update, both servers are stuck now:

image

@esatterwhite
Copy link

@Talkabout For sake of clarity and narrowing the problem, Does this happen on 6.2.0 ?

@Talkabout
Copy link
Author

Hi @esatterwhite,

I have not used 6.2.0 because it was causing major memory issues on my system, I switched directly from 6.0.16 to 6.2.1...

@MalavanEQAlpha
Copy link
Contributor

Hi @Talkabout @sts @esatterwhite it turns out this issue has always been present but just made incredibly more likely with 6.2.1(so 6.2.0 is as safe as 6.0.18).

The localtime_r function internally requires a lock, so any multithreaded program that forks when another thread is in the middle of a call to localtime_r will hang when the forked process calls localtime_r.

Prior to 6.2.1 that was extremely unlikely to happen, but in 6.2.1 we added a thread dedicated to checking the time. This makes repeated calls to localtime_r, massively increasing the chance that one is in flight when we fork for a background save.

The only call to localtime_r in the background save is within the syslog command, so disabling syslog(by setting the config: syslog-enabled no) should solve your issue for now while we work on a more complete fix.

@Talkabout
Copy link
Author

Hi @MalavanEQAlpha,

I have disabled syslog on my 2 servers and restarted KeyDB. At the moment background saving works, will report tomorrow again if issue is fixed or not.
Thanks for looking into this!

@Talkabout
Copy link
Author

Good news!

image

It seems that you brought it to the point @MalavanEQAlpha, thanks again, waiting for the fix!

@benschermel
Copy link
Collaborator

Thanks @MalavanEQAlpha, @Talkabout, & all on this thread. This issue should be resolved with the 6.2.2 release (PR#384). Closing this issue.

@Talkabout
Copy link
Author

Hi,

thanks for the information and the fix!

image

Looks good so far. Will report back if anything changes.

Bye

@Talkabout
Copy link
Author

Hi,

unfortunately the issue seems to be not fixed yet:

image

Anything I can provide for further analysis?

@esatterwhite
Copy link

Even with syslog disabled?

@Talkabout
Copy link
Author

I had syslog enabled during the above test. Now I have disabled it again and will report tomorrow if there also is a problem.

@esatterwhite
Copy link

What is this ui you are using?

@esatterwhite
Copy link

Is there anything in the replication logic that needs a quorum? I wonder if its related to the fact that you have an even number of servers?. Total shot in the dark

@Talkabout
Copy link
Author

What is this ui you are using?

This is phpredisadmin tool.

Is there anything in the replication logic that needs a quorum? I wonder if its related to the fact that you have an even number of servers?. Total shot in the dark

Not sure what to say here. I have 2 servers to assure a fallback if one of them crashes. As the issue was not there in 6.0.18 I don't think it has something to do with my setup, as it didn't change.

@Talkabout
Copy link
Author

image

Without syslog enabled the saving seems to work again. So there is still a bug in the syslog implementation.

@MalavanEQAlpha
Copy link
Contributor

Hi @Talkabout,
I am unable to reproduce the issue on version 6.2.2, can you please provide more details about how you are running KeyDB?
What operating system, how did you acquire KeyDB(binary/docker/git), details of the machine/VM/docker image running KeyDB, etc.
If possible a stack trace of the hanging process would be helpful as well, you can find instructions on how to do that with gdb here: https://sourceware.org/gdb/onlinedocs/gdb/Attach.html

@Talkabout
Copy link
Author

Hi,

Operating System: Debian 10 (Buster) on a Raspberry Pi
KeyDB: Build from source acquired from Github
Build steps:

sudo apt-get install -y build-essential nasm autotools-dev autoconf libjemalloc-dev tcl tcl-dev uuid-dev libcurl4-openssl-dev pkg-config uuid-dev && \
make distclean && \
make clean && \
make -j4 && \
checkinstall --install=no

Does that help already? Never worked with stack tracing running processes, but can try if really required.
Bye

@MalavanEQAlpha
Copy link
Contributor

Hi @Talkabout,
Thanks for the info, I was able to reproduce. I believe this should be fixed with PR #391 , can you try building from that branch for now(titled complete_fix_rdb_hang)?

@Talkabout
Copy link
Author

Hi @MalavanEQAlpha,
great that you were able to reproduce it. I have build this branch and running that version now:
image
Will report tomorrow about results of that test.
Thanks!

@Talkabout
Copy link
Author

Hi @MalavanEQAlpha,

Seems to be fixed :)

image

Will observe it the next days but I guess the problem is solved.

Thanks!

@Talkabout
Copy link
Author

Hi @MalavanEQAlpha,

Still looks good:
image
Bye

@marcocapetta
Copy link

Hi @MalavanEQAlpha,

We are having the same issue with KeyDB version 6.2.2 in an active-active replica.
keydb-rdb-bgsave process is stuck and in our case this is also preventing aof resize process to run, resulting in huge aof files.
Disabling the syslog as you suggested looks solving the issue.

I see you created a pull request about one month ago (PR #391) but this is still not merged. Is there any update for an official fix of the issue?

Thanks

@kitobelix
Copy link

Hey everyone, I know this may be marked as fixed, but I stumbled with this bug when using the docker image. I'm using 6.3.3, no replication, official image. Background save works for a day or so, and after that I have to either kill the save process or reboot the container to avoid having a fat aof.

I realized today about this because i though this was happening because of a misconfigured installation. But after basting my mind against a wall, i came here to see this might still be an alive bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants