Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.disk missing a disk in docker #10422

Closed
JoshKeegan opened this issue Jan 11, 2022 · 7 comments · Fixed by #11107
Closed

inputs.disk missing a disk in docker #10422

JoshKeegan opened this issue Jan 11, 2022 · 7 comments · Fixed by #11107
Assignees
Labels
area/system bug unexpected problem or unintended behavior

Comments

@JoshKeegan
Copy link
Contributor

Relevent telegraf.conf

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

Logs from Telegraf

2022-01-11T09:35:46Z I! Starting Telegraf 1.21.2
2022-01-11T09:35:46Z I! Using config file: /etc/telegraf/telegraf.conf
2022-01-11T09:35:46Z I! Loaded inputs: cpu disk diskio docker http kernel mem ping processes sensors smart swap system teamspeak
2022-01-11T09:35:46Z I! Loaded aggregators:
2022-01-11T09:35:46Z I! Loaded processors:
2022-01-11T09:35:46Z I! Loaded outputs: influxdb
2022-01-11T09:35:46Z I! Tags enabled:
2022-01-11T09:35:46Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s

System info

Telegraf v1.12.2 (latest)

Docker

docker-compose.yaml extract:

telegraf:
    build: ./images/telegraf
    privileged: true
    volumes:
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /proc:/host/proc:ro
      # Mount the HDD so it can access it for stats
      - /home/josh/data:/disks/hdd:ro
      # utmp from host required for system n_users count
      - /var/run/utmp:/var/run/utmp:ro
      - /dev:/dev:ro
      - /var/run/docker.sock:/var/run/docker.sock
    user: telegraf:${DOCKER_GID}
    environment:
      - HOST_PROC=/host/proc
    networks:
      - influxdb
    restart: unless-stopped

Steps to reproduce

  1. Have a system with multiple drives
  2. Run telegraf in docker with inputs.disk set up, mounting in any additional disks (in my case that is /home/josh/data => /disks/hdd)
    ...

Expected behavior

Disk visible in container (can docker exec into it & use df to check this). Disk should also be being picked up by the telegraf disk input and stats recorded.

Actual behavior

Disk visible in container (can docker exec into it & use df to check this). Disk is not being picked up by the telegraf disk input.
e.g. here is a partition usage graph using the data from telegraf and the disk dissapears after telegraf is upgraded from 1.20.x => 1.21.x:
image

Additional info

Same config was working in telegraf 1.20.x but the disk goes missing after upgrading to 1.21.x. I originally thought it could be related to #10297 but I see that is fixed in the last release and I still have the issue. I've downgraded to v1.20.4 for now as a workaround.

@JoshKeegan JoshKeegan added the bug unexpected problem or unintended behavior label Jan 11, 2022
@srebhan
Copy link
Member

srebhan commented Jan 11, 2022

@JoshKeegan are you really running 1.21.2? I thought this is fixed with PR #10318... Can you please start telegraf with --debug --once and send me the output of the disk plugin?

@srebhan srebhan self-assigned this Jan 11, 2022
@JoshKeegan
Copy link
Contributor Author

JoshKeegan commented Jan 11, 2022

Hi @srebhan,
Yes it's definitely v1.21.2. If I exec into the container, and run telegraf --version I get this: Telegraf 1.21.2 (git: HEAD 30d981d3).

The inputs.disk output of running with --debug --once is:

2022-01-11T14:10:28Z D! [inputs.disk] [SystemPS] partition 19: {"device":"/dev/sdb1","mountpoint":"/home/josh/data","fstype":"ext4","opts":["rw","relatime"]}
2022-01-11T14:10:28Z D! [inputs.disk] [SystemPS] -> using mountpoint "/home/josh/data"...
2022-01-11T14:10:28Z D! [inputs.disk] [SystemPS] => dropped by disk usage ("/home/josh/data"): no such file or directory

Now that I have read the above logs, it looks like it is reading the mount points from the host, not from the container. So as a workaround I have changed the drive to be mounted at the same path within the container (/home/josh/data) and telegraf now picks it up. Hopefully this can still be fixed in a future release though as it may not always be possible to have the mount points match.

Edit: I trimmed the log down to just the relevant part as it's huge, let me know if you need the full output.

@srebhan
Copy link
Member

srebhan commented Jan 24, 2022

@JoshKeegan sorry for my late response, but I've been busy quite a bit.

I'm a bit puzzled on why this happens, but I guess it's due to HOST_PROC being set. This way, the telegraf container will try to get the disks/partitions from the host. Can you try to unset HOST_PROC and see if it fixes the issue for you?

@JoshKeegan
Copy link
Contributor Author

No worries, thanks for looking into it!

I've tried it and unsetting HOST_PROC does fix this issue. However it needs to be set for the processes input plugin to work, otherwise that will only see process information from within the container.

@srebhan
Copy link
Member

srebhan commented Jan 24, 2022

That's... unfortunate. :-) I'm not sure on how to fix this as we cannot set and unset that environment variable at the same time in one process. Nor can we pass these variables as parameters to gopsutil (the underlying lib we use). To be honest I don't see a way to monitor the disks inside the container but the processes outside of the container without running two instances. @powersj or @sspaink any other idea?

@srebhan
Copy link
Member

srebhan commented May 17, 2022

Hey @JoshKeegan,

can you try the artifacts built in #11107 with the following docker-compose.yaml

telegraf:
    build: ./images/telegraf
    privileged: true
    volumes:
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /proc:/host/proc:ro
      # Mount the HDD so it can access it for stats
      - /home/josh/data:/disks/hdd:ro
      # utmp from host required for system n_users count
      - /var/run/utmp:/var/run/utmp:ro
      - /dev:/dev:ro
      - /var/run/docker.sock:/var/run/docker.sock
    user: telegraf:${DOCKER_GID}
    environment:
      - HOST_PROC=/host/proc
      - HOST_PROC_MOUNTINFO=/proc/1
    networks:
      - influxdb
    restart: unless-stopped

Please not the HOST_PROC_MOUNTINFO which overwrites HOST_PROC for getting the partition information from within the container.

@JoshKeegan
Copy link
Contributor Author

Thanks @srebhan that's fixed it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants