-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker cluster errors #119
Comments
Hi @everyonce, Guessing the following lines are from the logs of
Looks like that container is not able to talk to
|
ah, ok. I'm curious about the message, but also because of the cascading failure. Can the master not start up without a second master? isn't that a bit fragile? Maybe I'm not understanding the way these work together. Thanks! |
Ah, ok that error causing the one you noticed makes sense. In a "brand new universe creation" code path, we wait (configurable interval of time) for all the 3 masters to join. This is currently by design to avoid scenarios where due to some error, the universe did not get configured as expected but the error goes unnoticed (like this scenario). This especially comes in handy when we are provisioning as part of an automated workflow (like a Kubernetes yaml, terraform, etc). Once the universe is created, it can tolerate a master failure. All of the masters need not be up subsequently for operations to proceed. |
Ok, so I've determined that the above error is actually caused by an O_DIRECT + zfs + docker error. I'm fixing that on my side for now. I'm still having problems getting the entire docker cluster to come up at once, and I'm wondering how to change that timeout - it appears that master 1 gives up before master 2 is even online, all my errors now are dns errors from one node to another in the docker containers created by yb-docker-ctl |
cc @bmatican - could you please add the flag we need to increase the time the masters wait to find quorum on initial startup? |
I'm also concerned that it might be failing immediately if dns fails there - the timeout should include the dns lookup as well, instead of just the connect itself. |
Agreed @everyonce... also @bmatican - somehow thought we had increased this timeout to a very large value (many minutes). Would be good to check our defaults and our behavior on dns resolution failure as @everyonce pointed out. |
@everyonce based on your error message, yeah, O_DIRECT is a good guess. We are working on #18 to gracefully degrade from using O_DIRECT. As for the timeout part, we have this master flag From what I'm seeing in your first set of logs, Maybe @ramkumarvs has some more input on this, since we definitely got around this in |
Well - I am using yb-docker-ctl to do this, and getting that error |
@everyonce can you push your docker image to dockerhub or quay?. Or share your Dockerfile so we can debug this internally? We will also try to get the dockerfile with building the source available shortly. |
Yup. I've got two dockerfiles. They are both the same, except one has a workaround built in for the O_DIRECT stuff. |
@everyonce - forgot to update this issue. Here are the supporting issues for this one:
I am closing this issue out (since it seems like it will be resolved by the above), please reopen if there is something I have missed. Thanks! |
PG-259: Fixed PGSM build in Test-with-pg13-pgdg-packages GH action.
Update - newest errors are just around getting the cluster spun up without dns errors, etc.
So, in previous issues, I got your Dockerfile code, and now I'm trying to add onto that docker file to do the actual BUILD in docker as well. I've got that working, but with my completed image I get an error now when I run
yb-docker-ctl create
:`E0322 14:04:59.762842 32 master.cc:213] [email protected]:7100: Unable to init master catalog manager: Network error (yb/util/net/net_util.cc:195): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to initialize distributed config: Unable to resolve UUID for peer member_type: VOTER last_known_addr { host: "yb-master-n2" port: 7100 }: Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known
Network error (yb/util/net/net_util.cc:195): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to initialize distributed config: Unable to resolve UUID for peer member_type: VOTER last_known_addr { host: "yb-master-n2" port: 7100 }: Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known
F0322 14:04:59.763340 1 master.cc:137] Check failed: kRunning != state_ (2 vs. 2)
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2018-03-22T14_04_59.pid1.txt
F20180322 14:04:59 /home/yugabyte/code/src/yb/master/master.cc:137] Check failed: kRunning != state_ (2 vs. 2)
@ 0x7f8bec017e01 yb::LogFatalHandlerSink::send(int, char const*, char const*, int, tm const*, char const*, unsigned long) (yb/util/logging.cc:407)
@ 0x7f8beb1d5b75
@ 0x7f8beb1d33a9
@ 0x7f8beb1d621e
@ 0x7f8bf1ea5d95 yb::master::Master::~Master() (yb/master/master.cc:137)
@ 0x406665 MasterMain (yb/master/master_main.cc:82)
@ 0x7f8be7d27824 __libc_start_main (../csu/libc-start.c:289)
@ 0x4062c8 (unknown) (../sysdeps/x86_64/start.S:118)
@ 0xffffffffffffffff
*** Check failure stack trace: ***
@ 0x7f8bec01660b DumpStackTraceAndExit (yb/util/logging.cc:156)
@ 0x7f8beb1d384c
@ 0x7f8beb1d575c
@ 0x7f8beb1d33a9
@ 0x7f8beb1d621e
@ 0x7f8bf1ea5d95 yb::master::Master::~Master() (yb/master/master.cc:137)
@ 0x406665 MasterMain (yb/master/master_main.cc:82)
@ 0x7f8be7d27824 __libc_start_main (../csu/libc-start.c:289)
@ 0x4062c8 (unknown) (../sysdeps/x86_64/start.S:118)
@ 0xffffffffffffffff
*** Aborted at 1521727500 (unix time) try "date -d @1521727500" if you are using GNU date ***
PC: @ 0x7f8be7d3b536 __GI_abort
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7f8bf234da40) from PID 0; stack trace: ***
@ 0x7f8be80b3ba0 (unknown)
@ 0x7f8be7d3b536 __GI_abort
@ 0x7f8bec01665d yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7f8beb1d384d google::LogMessage::Fail()
@ 0x7f8beb1d575d google::LogMessage::SendToLog()
@ 0x7f8beb1d33aa google::LogMessage::Flush()
@ 0x7f8beb1d621f google::LogMessageFatal::~LogMessageFatal()
@ 0x7f8bf1ea5d96 yb::master::Master::~Master()
@ 0x406666 yb::master::MasterMain()
@ 0x7f8be7d27825 __libc_start_main
@ 0x4062c9 _start
@ 0x0 (unknown)
`
The text was updated successfully, but these errors were encountered: