Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker cluster errors #119

Closed
everyonce opened this issue Mar 23, 2018 · 12 comments
Closed

docker cluster errors #119

everyonce opened this issue Mar 23, 2018 · 12 comments
Assignees
Labels
kind/question This is a question

Comments

@everyonce
Copy link

everyonce commented Mar 23, 2018

Update - newest errors are just around getting the cluster spun up without dns errors, etc.


So, in previous issues, I got your Dockerfile code, and now I'm trying to add onto that docker file to do the actual BUILD in docker as well. I've got that working, but with my completed image I get an error now when I run yb-docker-ctl create:

`E0322 14:04:59.762842 32 master.cc:213] [email protected]:7100: Unable to init master catalog manager: Network error (yb/util/net/net_util.cc:195): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to initialize distributed config: Unable to resolve UUID for peer member_type: VOTER last_known_addr { host: "yb-master-n2" port: 7100 }: Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known
Network error (yb/util/net/net_util.cc:195): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to initialize distributed config: Unable to resolve UUID for peer member_type: VOTER last_known_addr { host: "yb-master-n2" port: 7100 }: Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known
F0322 14:04:59.763340 1 master.cc:137] Check failed: kRunning != state_ (2 vs. 2)
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2018-03-22T14_04_59.pid1.txt
F20180322 14:04:59 /home/yugabyte/code/src/yb/master/master.cc:137] Check failed: kRunning != state_ (2 vs. 2)
@ 0x7f8bec017e01 yb::LogFatalHandlerSink::send(int, char const*, char const*, int, tm const*, char const*, unsigned long) (yb/util/logging.cc:407)
@ 0x7f8beb1d5b75
@ 0x7f8beb1d33a9
@ 0x7f8beb1d621e
@ 0x7f8bf1ea5d95 yb::master::Master::~Master() (yb/master/master.cc:137)
@ 0x406665 MasterMain (yb/master/master_main.cc:82)
@ 0x7f8be7d27824 __libc_start_main (../csu/libc-start.c:289)
@ 0x4062c8 (unknown) (../sysdeps/x86_64/start.S:118)
@ 0xffffffffffffffff

*** Check failure stack trace: ***
@ 0x7f8bec01660b DumpStackTraceAndExit (yb/util/logging.cc:156)
@ 0x7f8beb1d384c
@ 0x7f8beb1d575c
@ 0x7f8beb1d33a9
@ 0x7f8beb1d621e
@ 0x7f8bf1ea5d95 yb::master::Master::~Master() (yb/master/master.cc:137)
@ 0x406665 MasterMain (yb/master/master_main.cc:82)
@ 0x7f8be7d27824 __libc_start_main (../csu/libc-start.c:289)
@ 0x4062c8 (unknown) (../sysdeps/x86_64/start.S:118)
@ 0xffffffffffffffff
*** Aborted at 1521727500 (unix time) try "date -d @1521727500" if you are using GNU date ***
PC: @ 0x7f8be7d3b536 __GI_abort
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7f8bf234da40) from PID 0; stack trace: ***
@ 0x7f8be80b3ba0 (unknown)
@ 0x7f8be7d3b536 __GI_abort
@ 0x7f8bec01665d yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7f8beb1d384d google::LogMessage::Fail()
@ 0x7f8beb1d575d google::LogMessage::SendToLog()
@ 0x7f8beb1d33aa google::LogMessage::Flush()
@ 0x7f8beb1d621f google::LogMessageFatal::~LogMessageFatal()
@ 0x7f8bf1ea5d96 yb::master::Master::~Master()
@ 0x406666 yb::master::MasterMain()
@ 0x7f8be7d27825 __libc_start_main
@ 0x4062c9 _start
@ 0x0 (unknown)
`

@rkarthik007
Copy link
Collaborator

Hi @everyonce,

Guessing the following lines are from the logs of yb-master-n1:

VOTER last_known_addr { host: "yb-master-n2" port: 7100 }: Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known

Looks like that container is not able to talk to yb-master-n2? A couple of things to look at:

  • Could you please install telnet and try to connect to yb-master-n2 on port 7100 from yb-master-n1? Just verifying that the 7100 port is not exposed to the other containers.
  • Docker needs a bridge network in order for the containers to talk to each other. The yb-docker-ctl script create a network yb-net and launches the containers into that network. Could you please verify that?

@rkarthik007 rkarthik007 self-assigned this Mar 23, 2018
@everyonce
Copy link
Author

ah, ok.
master 1 fails because it can't find master 2.
master 2 fails because it can't find master 3.
master 3 fails with this message:
E0323 03:40:32.991353 32 master.cc:213] [email protected]:7100: Unable to init master catalog manager: IO error (yb/util/env_posix.cc:202): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to open new log: Call to mkstemp() failed on name template /mnt/disk1/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/.tmp.newsegmentXXXXXX: Invalid argument (error 22) IO error (yb/util/env_posix.cc:202): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to open new log: Call to mkstemp() failed on name template /mnt/disk1/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/.tmp.newsegmentXXXXXX: Invalid argument (error 22)

I'm curious about the message, but also because of the cascading failure. Can the master not start up without a second master? isn't that a bit fragile? Maybe I'm not understanding the way these work together.

Thanks!

@rkarthik007
Copy link
Collaborator

Ah, ok that error causing the one you noticed makes sense.

In a "brand new universe creation" code path, we wait (configurable interval of time) for all the 3 masters to join. This is currently by design to avoid scenarios where due to some error, the universe did not get configured as expected but the error goes unnoticed (like this scenario). This especially comes in handy when we are provisioning as part of an automated workflow (like a Kubernetes yaml, terraform, etc).

Once the universe is created, it can tolerate a master failure. All of the masters need not be up subsequently for operations to proceed.

@everyonce
Copy link
Author

Ok, so I've determined that the above error is actually caused by an O_DIRECT + zfs + docker error. I'm fixing that on my side for now.

I'm still having problems getting the entire docker cluster to come up at once, and I'm wondering how to change that timeout - it appears that master 1 gives up before master 2 is even online, all my errors now are dns errors from one node to another in the docker containers created by yb-docker-ctl

@rkarthik007
Copy link
Collaborator

cc @bmatican - could you please add the flag we need to increase the time the masters wait to find quorum on initial startup?

@everyonce
Copy link
Author

I'm also concerned that it might be failing immediately if dns fails there - the timeout should include the dns lookup as well, instead of just the connect itself.

@rkarthik007
Copy link
Collaborator

Agreed @everyonce... also @bmatican - somehow thought we had increased this timeout to a very large value (many minutes). Would be good to check our defaults and our behavior on dns resolution failure as @everyonce pointed out.

@bmatican
Copy link
Contributor

@everyonce based on your error message, yeah, O_DIRECT is a good guess. We are working on #18 to gracefully degrade from using O_DIRECT.

As for the timeout part, we have this master flag master_discovery_timeout_ms, but it defaults to 1h, so I don't think that's the problem.

From what I'm seeing in your first set of logs, Unable to resolve address yb-master-n2, getaddrinfo returned -2 (EAI_NONAME): Name or service not known is the most indicative of the problem. We're trying to do a DNS resolution (src/yb/consensus/consensus_peers.cc:457) in order to create an RPC proxy to the peer and that seems to fail in your Docker environment.

Maybe @ramkumarvs has some more input on this, since we definitely got around this in yb-docker-ctl.

@everyonce
Copy link
Author

Well - I am using yb-docker-ctl to do this, and getting that error
I'm wondering if there's some way that the first master is spun up and attempting to do the dns resolution before yb-docker-ctl has even had time to create the second one.
Thanks for the help, and let me know if there's anything else I can try that helps debug it.

@ramkumarvs
Copy link
Contributor

@everyonce can you push your docker image to dockerhub or quay?. Or share your Dockerfile so we can debug this internally?

We will also try to get the dockerfile with building the source available shortly.

@everyonce
Copy link
Author

Yup. I've got two dockerfiles. They are both the same, except one has a workaround built in for the O_DIRECT stuff.
BOTH are multi-stage with the first stage doing the build - this may be involved in the problem, but it does compile correctly within the docker container.

https://github.com/everyonce/yugabyte-db

@rkarthik007 rkarthik007 added the kind/question This is a question label Apr 6, 2018
@rkarthik007
Copy link
Collaborator

@everyonce - forgot to update this issue. Here are the supporting issues for this one:

I am closing this issue out (since it seems like it will be resolved by the above), please reopen if there is something I have missed. Thanks!

jasonyb pushed a commit that referenced this issue Jun 11, 2024
PG-259: Fixed PGSM build in Test-with-pg13-pgdg-packages GH action.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question This is a question
Projects
None yet
Development

No branches or pull requests

4 participants