Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control process not returned to systemd #7137

Closed
PMacho opened this issue Jan 25, 2020 · 8 comments
Closed

Control process not returned to systemd #7137

PMacho opened this issue Jan 25, 2020 · 8 comments
Assignees
Labels
type/docs Documentation needs to be created/updated/clarified

Comments

@PMacho
Copy link

PMacho commented Jan 25, 2020

Overview of the Issue

I just tried the nice documentation on https://learn.hashicorp.com/consul/datacenter-deploy/deployment-guide. However, systemd can start consul and everything locks fine but the startup process never returns and thus finally systemd runs in a timeout.

Reproduction Steps

Exactly like shown on https://learn.hashicorp.com/consul/datacenter-deploy/deployment-guide
This is for test purposes so I just configured bootstrap_expect = 1, no retry_join, performance or metrics.

Consul info for both Client and Server

Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = 1200f25e
	version = 1.6.2
consul:
	acl = disabled
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 10.2.0.2:8300
	server = true
raft:
	applied_index = 186
	commit_index = 186
	fsm_pending = 0
	last_contact = 0
	last_log_index = 186
	last_log_term = 9
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:f516c909-0079-07ba-cb2a-f2cd3d5a405b Address:10.2.0.2:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 9
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 80
	max_procs = 1
	os = linux
	version = go1.12.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 1
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

/usr/local/bin/consul --version
Consul v1.6.2
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

vim /etc/systemd/system/consul.service
[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/consul.d/consul.hcl
[Service]
Type=notify
User=consul
Group=consul
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/
ExecReload=/usr/local/bin/consul reload
KillMode=process
Restart=on-failure
LimitNOFILE=65536
TimeoutSec=900
[Install]
WantedBy=multi-user.target

Log Fragments

Jan 25 19:49:15 test systemd[1]: Starting "HashiCorp Consul - A service mesh solution"...
Jan 25 19:49:15 test consul[10967]: BootstrapExpect is set to 1; this is the same as Bootstrap mode.
Jan 25 19:49:15 test consul[10967]: WARNING: LAN keyring exists but -encrypt given, using keyring
Jan 25 19:49:15 test consul[10967]: WARNING: WAN keyring exists but -encrypt given, using keyring
Jan 25 19:49:15 test consul[10967]: bootstrap = true: do not enable unless necessary
Jan 25 19:49:15 test consul[10967]: ==> Starting Consul agent...
Jan 25 19:49:15 test consul[10967]: Version: 'v1.6.2'
Jan 25 19:49:15 test consul[10967]: Node ID: 'f516c909-0079-07ba-cb2a-f2cd3d5a405b'
Jan 25 19:49:15 test consul[10967]: Node name: 'test'
Jan 25 19:49:15 test consul[10967]: Datacenter: 'test' (Segment: '')
Jan 25 19:49:15 test consul[10967]: Server: true (Bootstrap: true)
Jan 25 19:49:15 test consul[10967]: Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Jan 25 19:49:15 test consul[10967]: Cluster Addr: 10.2.0.2 (LAN: 8301, WAN: 8302)
Jan 25 19:49:15 test consul[10967]: Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
Jan 25 19:49:15 test consul[10967]: ==> Log data will now stream in as it occurs:
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:f516c909-0079-07ba-cb2a-f2cd3d5a405b Address:10.2.0.2:8300}]
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] serf: EventMemberJoin: test.test 10.2.0.2
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] serf: EventMemberJoin: test 10.2.0.2
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] raft: Node at 10.2.0.2:8300 [Follower] entering Follower state (Leader: "")
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [WARN] serf: Failed to re-join any previously known node
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [WARN] serf: Failed to re-join any previously known node
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] consul: Adding LAN server test (Addr: tcp/10.2.0.2:8300) (DC: test)
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] consul: Handled member-join event for server "test.test" in area "wan"
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
Jan 25 19:49:15 test consul[10967]: 2020/01/25 19:49:15 [INFO] agent: started state syncer
Jan 25 19:49:15 test consul[10967]: ==> Consul agent running!
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [WARN] raft: Heartbeat timeout from "" reached, starting election
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [INFO] raft: Node at 10.2.0.2:8300 [Candidate] entering Candidate state in term 7
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [INFO] raft: Election won. Tally: 1
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [INFO] raft: Node at 10.2.0.2:8300 [Leader] entering Leader state
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [INFO] consul: cluster leadership acquired
Jan 25 19:49:20 test consul[10967]: 2020/01/25 19:49:20 [INFO] consul: New leader elected: test
Jan 25 19:49:21 test consul[10967]: 2020/01/25 19:49:21 [INFO] agent: Synced node info
Jan 25 20:04:15 test systemd[1]: consul.service: Start operation timed out. Terminating.
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Caught signal: terminated
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Graceful shutdown disabled. Exiting
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Requesting shutdown
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] consul: shutting down server
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [WARN] serf: Shutdown without a Leave
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [WARN] serf: Shutdown without a Leave
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] manager: shutting down
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: consul server down
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: shutdown complete
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (tcp)
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (udp)
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Stopping HTTP server 127.0.0.1:8500 (tcp)
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Waiting for endpoints to shut down
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Endpoints down
Jan 25 20:04:15 test consul[10967]: 2020/01/25 20:04:15 [INFO] agent: Exit code: 1
Jan 25 20:04:15 test systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 20:04:15 test systemd[1]: consul.service: Failed with result 'timeout'.
Jan 25 20:04:15 test systemd[1]: Failed to start "HashiCorp Consul - A service mesh solution".
Jan 25 20:04:15 test systemd[1]: consul.service: Service hold-off time over, scheduling restart.
Jan 25 20:04:15 test systemd[1]: consul.service: Scheduled restart job, restart counter is at 1.
Jan 25 20:04:15 test systemd[1]: Stopped "HashiCorp Consul - A service mesh solution".

@iotdog
Copy link

iotdog commented Feb 19, 2020

I have the same issue, did you resolve it?

@iotdog
Copy link

iotdog commented Feb 22, 2020

I figure it out, as in the service definition file, there is an option Type under [service] seciton:

[Service]
Type=notify

notify means the service should send a notification message to systemd when it has finished starting up, otherwise systemd will not proceed, see systemd document for details.

for consul, it just send the notification after a LAN join has completed, see consul agent:

When running under systemd on Linux, Consul notifies systemd by sending READY=1 to the $NOTIFY_SOCKET when a LAN join has completed. For this either the join or retry_join option has to be set and the service definition file has to have Type=notify set.

but our configuration is to set up a single node agent (bootstrap_expect = 1, no retry_join), so LAN join will not complete ever, this is why systemd get timeout.

we can solve this by set Type option to "exec", systemd will proceed after consul process started.

but I have other problem, when I run systemctl stop consul, the log always show that consul exits with error:

Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: consul server down
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: shutdown complete
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=tcp
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=udp
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=HTTP address=[::]:8500 network=tcp
Feb 22 21:01:53 pyddot-ecs consul[6602]: 2020-02-22T21:01:53.238+0800 [WARN] agent: Timeout stopping server: protocol=HTTP address=[::]:8500 network=tcp
Feb 22 21:01:53 pyddot-ecs systemd[1]: consul.service: main process exited, code=exited, status=1/FAILURE
Feb 22 21:01:53 pyddot-ecs systemd[1]: Stopped "HashiCorp Consul - A service mesh solution".
Feb 22 21:01:53 pyddot-ecs systemd[1]: Unit consul.service entered failed state.
Feb 22 21:01:53 pyddot-ecs systemd[1]: consul.service failed.

it seems that consul does not gracefully stop. still don't know why...

@danielehc danielehc added the type/docs Documentation needs to be created/updated/clarified label Apr 17, 2020
@danielehc
Copy link
Contributor

Hi @iotdog and @PMacho , thanks for reporting this.

I created an internal issue for our learn site to investigate on this and will comment on this issue as soon as we have updates.

@PMacho
Copy link
Author

PMacho commented Apr 17, 2020

I figure it out, as in the service definition file, there is an option Type under [service] seciton:

[Service]
Type=notify

notify means the service should send a notification message to systemd when it has finished starting up, otherwise systemd will not proceed, see systemd document for details.

for consul, it just send the notification after a LAN join has completed, see consul agent:

When running under systemd on Linux, Consul notifies systemd by sending READY=1 to the $NOTIFY_SOCKET when a LAN join has completed. For this either the join or retry_join option has to be set and the service definition file has to have Type=notify set.

but our configuration is to set up a single node agent (bootstrap_expect = 1, no retry_join), so LAN join will not complete ever, this is why systemd get timeout.

we can solve this by set Type option to "exec", systemd will proceed after consul process started.

but I have other problem, when I run systemctl stop consul, the log always show that consul exits with error:

Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: consul server down
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: shutdown complete
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=tcp
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=udp
Feb 22 21:01:52 pyddot-ecs consul[6602]: 2020-02-22T21:01:52.238+0800 [INFO] agent: Stopping server: protocol=HTTP address=[::]:8500 network=tcp
Feb 22 21:01:53 pyddot-ecs consul[6602]: 2020-02-22T21:01:53.238+0800 [WARN] agent: Timeout stopping server: protocol=HTTP address=[::]:8500 network=tcp
Feb 22 21:01:53 pyddot-ecs systemd[1]: consul.service: main process exited, code=exited, status=1/FAILURE
Feb 22 21:01:53 pyddot-ecs systemd[1]: Stopped "HashiCorp Consul - A service mesh solution".
Feb 22 21:01:53 pyddot-ecs systemd[1]: Unit consul.service entered failed state.
Feb 22 21:01:53 pyddot-ecs systemd[1]: consul.service failed.

it seems that consul does not gracefully stop. still don't know why...

Indeed solved the problem. Thank you. And yes I also see this error when stopping consul.

@rickyma
Copy link

rickyma commented Apr 21, 2020

I found a workaround solution. Add these two configurations in your Consul Server config file:

bootstrap_expect = 1
retry_join = ["127.0.0.1"]

This can work fine with type=notify.

@Anricx
Copy link

Anricx commented Apr 26, 2020

same issue, it make me

Hi @iotdog and @PMacho , thanks for reporting this.

I created an internal issue for our learn site to investigate on this and will comment on this issue as soon as we have updates.

how's it going?

@danielehc danielehc self-assigned this May 18, 2020
@danielehc
Copy link
Contributor

@Anricx , @PMacho , @iotdog the issue should now be fixed in the learn platform and the new version is live at https://learn.hashicorp.com/consul/datacenter-deploy/deployment-guide#configure-systemd.

We are closing this issue. In case you still get issues with the systemd configuration feel free to re-open this ticket.

@axsuul
Copy link

axsuul commented Jun 19, 2021

I think Type=exec is a better default since it's better if the Consul agent process remains running even if it hasn't joined with others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

6 participants