Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secrets on PostgreSQL plugin require increased limit on locked memory #12980

Closed
PierreF opened this issue Mar 29, 2023 · 4 comments · Fixed by #12990
Closed

Secrets on PostgreSQL plugin require increased limit on locked memory #12980

PierreF opened this issue Mar 29, 2023 · 4 comments · Fixed by #12990
Assignees
Labels
bug unexpected problem or unintended behavior plugin/secretstores

Comments

@PierreF
Copy link
Contributor

PierreF commented Mar 29, 2023

Relevant telegraf.conf

[[inputs.postgresql]]
  address = "host=172.17.0.2 port=5432 user=postgres password=abc dbname=postgres sslmode=disable"
[[inputs.postgresql]]
  address = "host=172.17.0.3 port=5432 user=postgres password=abc dbname=postgres sslmode=disable"
[[inputs.postgresql]]
  address = "host=172.17.0.4 port=5432 user=postgres password=abc dbname=postgres sslmode=disable"
[[inputs.postgresql]]
[...]
[[inputs.postgresql]]
  address = "host=172.17.0.17 port=5432 user=postgres password=abc dbname=postgres sslmode=disable"

Logs from Telegraf

2023-03-29T12:17:39Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 10s
2023-03-29T12:17:39Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 10s
2023-03-29T12:17:39Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 10s
2023-03-29T12:17:39Z W! [inputs.postgresql] Collection took longer than expected; not complete after interval of 10s

System info

Telegraf 1.26.0 Ubuntu 22.04

Docker

No response

Steps to reproduce

  1. Start multiple (at least 7) PostgreSQL containers:
for i in $(seq 1 10); do docker run -d --name postgresql$i -e POSTGRES_PASSWORD=abc postgres:14;done 
  1. Configure telegraf to gather all those PostgreSQL (cf part of the telegraf.conf above)
  2. Start telegraf as user with the ulimit for memory locked that match what systemd would do (64kb):
sudo -u telegraf -i
ulimit -Hl 64; ulimit -l 64  # This is needed because sudo increase the limit. A service started with systemd had 64kb by default, at least on Ubuntu 22.04

./usr/bin/telegraf --config ./etc/telegraf/telegraf.conf --config-directory ./etc/telegraf/telegraf.d/
  1. Wait a bit, it should be rather fast
  2. Start to have "Collection took longer than expected" and all PostgreSQL stop being monitored
  3. Telegraf will no longer stop cleanly

Expected behavior

PostgreSQL metrics still being sent.

Actual behavior

PostgreSQL stop being monitored and part of Telegraf (anything that depend on blocked PostgreSQL input) will hang

Additional info

The hang itself is due to awnumar/memguard#147

The cause of the hang is due to each Gather() from PostgreSQL input required some new locked memory pages:

  • At rest, Telegraf require 3 pages of locked memory (you can check the number of pages locked with cat /proc/$PID/smaps_rollup). Those 3 pages come from: left, right and rand of memguard
  • During each gather, PostgreSQL plugin will add up to 2 new locked pages:
    • This is due to SanitizedAddress that is called by Gather()
    • SanitizedAddress read the secret (p.Address.Get()).
    • Reading the secret will open the enclave which internally will required 2 pages (at least): one for the key - destroyed before returning - one for the secret which is returned
    • Telegraf secret will protect the secret (needing a locked page) while the enclave buffer is not yet destroyed
  • At the ends, with the 64kb default limit, we could have up to 16 pages of locked memory:
    • 3 usage permanently used by memguard Coffer
    • 2 temporary PER concurrent Gather(). So as soon as 7 Gather are run concurrently: 3 + 2 * 7 == 17, one page allocation will fail.

It might be possible to reduce requirement for locked memory, we could do the Sanitize address once ? It don't seems to contains the password and don't need locked memory.
We might also say that Telegraf require more locked memory and increase the limit in the Telegraf SystemD unit ( LimitMEMLOCK).

@srebhan
Copy link
Member

srebhan commented Mar 29, 2023

While the final effect, the hang of the process, is caused by memguard, there might be an underlying issue of growing too many locked memory pages. This is similar to #12924 and #12982. Investigating...

@powersj
Copy link
Contributor

powersj commented Mar 30, 2023

@PierreF,

Start telegraf as user with the ulimit for memory locked that match what systemd would do (64kb):

Do you know why or how the lock ulimit is set to 64kb? I am not seeing the same on a 22.04 VM. Is this a container, VM, or baremetal system?

Can you get the output from the following commands:

cat /etc/os-release
uname -a
ulimit -a
systemctl show | grep -i LimitMEMLOCK
sudo cloud-init schema --system

We might also say that Telegraf require more locked memory and increase the limit in the Telegraf SystemD unit ( LimitMEMLOCK).

This is the route we probably are going to go for now. Sven will continue working with the memguard library to see if we can use the pages more effectively or if we need to go a different direction as well.

Thanks!

powersj added a commit to powersj/telegraf that referenced this issue Mar 30, 2023
If set too low, the lock memory can run out when using any plugin that
makes use of the secrets data type. Locking and unlocking memory takes a
page each, so multiple plugins or mulitple instances of a plugin can
quickly go through pages.

fixes: influxdata#12980
@PierreF
Copy link
Contributor Author

PierreF commented Mar 30, 2023

I reproduced this issue in odd setup: a "Docker VM" (docker run --privileged /sbin/init) on Docker Desktop on MacOS

But I took the 64kb value from a EC2 running Ubuntu 22.04 on a service run by systemd.

Here the result on the "Docker VM":

root@ubuntu2204:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

root@ubuntu2204:~# uname -a
Linux ubuntu2204 5.15.49-linuxkit #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

root@ubuntu2204:~# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 63482
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

root@ubuntu2204:~# systemctl show | grep -i LimitMEMLOCK
DefaultLimitMEMLOCK=65536
DefaultLimitMEMLOCKSoft=65536

root@ubuntu2204:~# sudo cloud-init schema --system
sudo: cloud-init: command not found

Here the result on the real server where I hit the issue:

pierref@par02-app02:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

pierref@par02-app02:~$ uname -a
Linux par02-app02 5.15.0-1026-aws #30-Ubuntu SMP Wed Nov 23 14:15:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

pierref@par02-app02:~$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 31338
max locked memory           (kbytes, -l) 1004544
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 31338
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

pierref@par02-app02:~$ systemctl show | grep -i LimitMEMLOCK
DefaultLimitMEMLOCK=65536
DefaultLimitMEMLOCKSoft=65536

pierref@par02-app02:~$ sudo cloud-init schema --system
[sudo] password for pierref:
Valid cloud-config: user-data

In my case, the 64kb limit come from systemctl show | grep -i LimitMEMLOCK (thank for that command, I didn't found a way to see that value). Since I'm using telegraf from a SystemD unit that didn't specify LimitMEMLOCK.

But on both server, we don't set DefaultLimitMEMLOCK, it's the default from the binary:

pierref@par02-app02:~$ ls -lh /etc/systemd/system.conf /etc/systemd/system.conf.d/ /run/systemd/system.conf.d/ /lib/systemd/system.conf.d/
ls: cannot access '/etc/systemd/system.conf.d/': No such file or directory
ls: cannot access '/run/systemd/system.conf.d/': No such file or directory
ls: cannot access '/lib/systemd/system.conf.d/': No such file or directory
-rw-r--r-- 1 root root 2.0K Apr 21  2022 /etc/systemd/system.conf

pierref@par02-app02:~$ grep DefaultLimitMEMLOCK /etc/systemd/system.conf
#DefaultLimitMEMLOCK=

@powersj
Copy link
Contributor

powersj commented Mar 30, 2023

Thanks for the fast response!

I put up PR #12990 which sets the lock memory limit to be 8192kb, more than enough.

Here the result on the real server where I hit the issue:

ah ok I was getting my bytes and kilobytes mixed up. My 22.04 VM does in fact show systemd memory lock limit of 64k :( while my desktop running arch shows 8192kb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior plugin/secretstores
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants