-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large database problem #5
Comments
Why does your service crash? You should definitely tune MySQL config to not allocate more memory than is available and to prevent major docker malfunction also set the memory limits with docker so that the OOM killer doesn't kill the docker daemon. I've used this same image on Kontena with a 4GB compressed dump (over 30GB data directory once imported into MySQL) and not had this issue but my machines had 64GB RAM. A persistent volume is definitely a must because in the case of a crash or database upgrade or reboot or other event you want the server to be able to use a small incremental snapshot to rejoin the cluster and not a full snapshot each time, not to mention if all nodes go down you don't want to lose your entire dataset. |
Check out these options from the
There are other ones which pertain to memory limits and mounted volumes. I didn't use all options in the example because they will vary per user. |
The problem is with HC command HEALTHCHECK CMD curl -f -o - http://localhost:8080/ || exit 1 Short time causes docker starts to kill containers and stop process on donor site. At the end whole cluster is down I've tested new configuration HEALTHCHECK --interval=5m --timeout=3s CMD curl -f -o - http://localhost:8080/ || exit 1 I think HEALTHCHECK CMD should also check if container is currently synchrinizing with donor |
I don't think that just bumping up the timeout is a good solution as that is dependent on your dataset size. If it is larger then it will still be broken. Also, a 5 minute interval means you could be down for 5 whole minutes before it is noticed and what do you do as your database grows? Also, what if it becomes a donor right toward the end of the 5 minutes? Note, there are actually two health checks running, one of them is considered up when a donor, so just switch the HC command to use the 8081 port number: https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/start.sh#L215 |
Hello I have opportunity to examine above problems with 50GB database. So I am expecting that heltcheck CMD returns something else then 503, additionaly I've enabled logs for healtcheck to see what is going on. Conclusions:
root@171d3aa89054:/# cat /tmp/8081
Look at logs
|
Maybe the healthcheck should be updated to report "healthy" between the time the mysqld is exec'd and it joins successfully, but I'm not sure the best way to do this.. Perhaps the presence of a state file could help? The /var/lib/mysql/gvwstate.dat file could be useful. I suspect that this file is missing during the state transfer so something like the following could be used:
I think you'd also need to add Please give that a shot and let me know how it goes. Thanks! |
Thanks for reply.
|
Ahh, perfect! I forgot about that file! Do you care to submit a PR with the improve healthcheck? Thanks! |
Yes sure, let me do more tests and I will submit tomorrow. Thanks for help! |
Since sst_in_progress is deleted when sst finishes then it could probably be simplified further:
Also, might want to add an |
More simplified solutions could be HEALTHCHECK CMD test -f /var/lib/mysql/sst_in_progress || curl -f -o - http://localhost:8080/ || exit 1 or you do not have to change anything because I can overwrite HEALTHCHECK CMD in docker-compose.yml |
Ok, I've gone ahead and updated the default to use the sst_in_progress test since this is likely to be a recurring issue for other users as well. Thanks for your help, and I hope this container serves you well! |
Hi Colin |
How are you routing traffic? The various healthchecks have different purposes in different cases. In one case the scheduler needs to know if the instance should be killed. I think for the HEALTHCHECK CMD this needs to work as it does so that the node is not killed. In other cases the system needs to know if traffic can be routed. Your routing system needs to be aware of the sync status and act accordingly. This is what the 8080 healthcheck is for, but how you make use of it depends on your system architecture. Currently I'm using HAProxy but there are other ways to do it. Unfortunately just using the swarm routing is not sufficient unless swarm adds a way to do separate healthchecks for "don't kill me" status vs "healthy". |
That is the reason I am writing this :). I was wondering how you achieve that because I saw no option to do that without additional software like haproxy. |
Looks like Docker 17.09 will add support for a "start_period" which could be a good workaround. In this case you would make the healthcheck only report healthy if synced and set the start_period to be sufficiently long to allow a sync to occur. See docker/cli#475 Regarding HAProxy my config looks like this:
This provides two frontends, one for all nodes roundrobin and one that goes to just the first node for transactions that are not cluster-aware. The config has to be rendered by my "haproxy-service" container which can be found here: https://github.com/colinmollenhour/haproxy-service |
Colin |
Why not use both? :) If your app is already cluster-friendly on the writes then you probably don't need HAProxy, but if you need all writes directed to the same node to ensure single-node level transactions then HAProxy is a pretty good solution. |
Hi
I have been doing lot of tests since 2 weeks.
Everything works like a charm with small database max 800MB.
Problem starts with huge database over several GB.
My first approach was to do this in way you suggested here
#1
This is very interesting but the problem is when your services crash. When it crash, docker removes volumes and starts creating new containers with blank db and all logs and data disappear.
My second approach was to create mysql-seed with no dedicated volume and nodes in global mode with dedicated volume. I've created cluster and imported 3GB of data.
Then I've added new server to cluster, docker created new container and started transfering data. After X seconds heltcheck kills container because of no mysql process runing
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111 "Connection refused")
and this scenario repeats forever
This behavior sometime causes problem with other nodes (dont know why) and other nodes starts to restarting. If you dont have dedicated volume you lost your data.
The text was updated successfully, but these errors were encountered: