Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

systemd[1]: Failed to send queued message: No buffer space available #1744

Closed
f0 opened this issue Jan 3, 2017 · 14 comments
Closed

systemd[1]: Failed to send queued message: No buffer space available #1744

f0 opened this issue Jan 3, 2017 · 14 comments

Comments

@f0
Copy link

f0 commented Jan 3, 2017

Today i found that one of our Servers (Bare Metal) no longer get new units via fleet.
After some debugging , i see that fleet dos not sync the state with etcd

Direct after restarting the fleet daemon, i see this error in the logs

systemd[1]: Failed to send queued message: No buffer space available

Also all systemd comands are "slow" or runs into a timeout (e.g systemctl status...)

I try to restart the server via reboot , this hangs and in the HW Console i see a loop of umount commands. Only a hard reset from the HW Console works

Here is the journal from this

Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-de8aa5e9-5cfd-fcb1-3afe-842a5be52d98...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-e3553dfa-2aae-4ee6-da09-c0e72606268b...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-cf75216f-8b01-f27f-29a8-1c046c4e05a4...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-8921caa5-a35a-65af-7e71-256a3e9197e2...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-22ffda5e-23f3-0f4e-8666-07dafa44c0f1...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-058d730f-8e5b-f60f-7fdb-0618931f35ef...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-c658e1fd-67be-2638-ba5d-1ee881cb406e...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-119fe5f3-6510-0fc3-08d4-d702fa730472...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-a3ea93dc-229b-3944-1e8a-efbd415a8af6...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-d96efdc6-ea17-4bcf-d295-732c7eaaaf4b...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-85f33dbc-fe0d-75f5-8754-4a466376a922...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-1b819d33-caab-613f-f7b3-41f051fe8f72...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-0006c94e-bfa0-a22e-e905-de3f9c009a5a...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-f9567420-a997-af01-86cf-f3e474a9f6b2...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-e20cd214-4266-f94e-fbef-443c361b3334...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-645a3264-574b-50c2-aefe-3902d8b908bb...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-108137a1-53bc-dd95-9c72-066723a327b2...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-9594a622-03e4-5bd6-b948-5be55fd8eacd...
Jan 03 12:03:30 example.net systemd[1]: Unmounting /run/netns/cni-de401fb7-b30d-b492-01a0-8342f6702e92...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-278f47d7-8902-7f17-5550-d97556e40efa...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-610deff7-3066-122b-8ece-2c0e6140545a...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-12b47782-e686-169d-370c-14bd009a3103...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-6317f84d-2688-afcb-d88f-57d90c3062eb...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-ba86ad40-44a1-da0f-9446-7ae70fa4cdb4...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-823f9ac6-9746-9b8c-1cc4-7dfd8032208f...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-728fc04c-df0c-af95-89ee-500e65738c0a...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-83f1ef9e-6fa7-03ad-c6ac-5744a6e3351f...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-44e45496-16b0-31f1-685e-c98047a86ef1...
Jan 03 12:03:31 example.net systemd[1]: Unmounting /run/netns/cni-e5879c50-b8e8-8c0e-5df0-db0f81eeec6d...
Jan 03 12:03:37 example.net kernel: INFO: task kworker/17:45:115119 blocked for more than 120 seconds.
Jan 03 12:03:37 example.net kernel:       Not tainted 4.7.3-coreos-r2 #1
Jan 03 12:03:37 example.net kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 03 12:03:37 example.net kernel: kworker/17:45   D ffff881cbc4a7d18     0 115119      2 0x00000080
Jan 03 12:03:37 example.net kernel: Workqueue: events memcg_kmem_cache_create_func
Jan 03 12:03:37 example.net kernel:  ffff881cbc4a7d18 ffff88203e816cb0 ffff881d0fa9ba80 ffff881d0fa99d40
Jan 04 12:03:37 example.net kernel:  ffffffffad0ab296 ffff881cbc4a8000 ffffffffada6cac4 ffff881d0fa99d40

(this is actually truncated, there are 10x such messages in the log)

After the reboot, the Problem is gone.....

CoreOS Version

NAME=CoreOS
ID=coreos
VERSION=1185.3.0
VERSION_ID=1185.3.0
BUILD_ID=2016-11-01-0605
PRETTY_NAME="CoreOS 1185.3.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
@f0
Copy link
Author

f0 commented Jan 6, 2017

maybe related rkt/rkt#3486

@lucab
Copy link

lucab commented Jan 6, 2017

It sounds similar but I don't think rkt is involved in that. How is fleet being run?

@f0
Copy link
Author

f0 commented Jan 6, 2017

@lucab what do you mean? Its default fleet from stable coreos with etcd2

@crawford
Copy link
Contributor

This might be related to #1742. Can you look at the available system memory?

@f0
Copy link
Author

f0 commented Jan 12, 2017

@crawford the system had 128GB memory... and not all memory was used

@crawford
Copy link
Contributor

Heh, yeah probably didn't chew through all of that.

@f0
Copy link
Author

f0 commented Jan 13, 2017

and fleet does not consume much

systemctl status fleet
● fleet.service - fleet daemon
   Loaded: loaded (/usr/lib/systemd/system/fleet.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-01-03 12:12:09 CET; 1 weeks 2 days ago
 Main PID: 10263 (fleetd)
    Tasks: 8
   Memory: 41.9M
      CPU: 14h 45min 47.947s
   CGroup: /system.slice/fleet.service
           └─10263 /usr/bin/fleetd

@f0
Copy link
Author

f0 commented Mar 11, 2017

@crawford happend again on another node, exactly the same behaviour

@crawford
Copy link
Contributor

@f0 We have still never been able to reproduce this failure. This upcoming Alpha should have system 233, so it will be interesting to see if you still run into this.

@crawford
Copy link
Contributor

@f0 Are you still seeing this issue?

@crawford
Copy link
Contributor

Closing due to inactivity.

@cdosborn
Copy link

I came across this error (not in coreos), the following command resolved the issue, which appears to be a bug in systemd

systemctl daemon-reexec

@isality
Copy link

isality commented Jul 27, 2018

the problem is still relevant!

# apt-cache policy systemd
systemd:
  Installed: 229-4ubuntu21.2
  Candidate::   229-4ubuntu21.2

@lucab
Copy link

lucab commented Jul 27, 2018

The underlying issue here is the same as upstream bug systemd/systemd#4068. Synchronous dbus operations requires some buffering on systemd side, and under high workload those buffers may saturate. Buffer sizes got bumped in v232, which should alleviate this is issue in most cases, but buffers can still fill up in extreme situations.

As this is starting to attract unrelated non-coreos followups, I'm going to lock this conversation. Further specific bugs can be discussed in new dedicated tickets.

@coreos coreos locked as spam and limited conversation to collaborators Jul 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants