Vector keeps old Inodes, clogging up disk #11742

MaxRink · 2022-03-09T11:08:55Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We ran into the issue that Vector is keeping old Inodes open, thus filling up the disk

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G   49G     0 100% /

INodes: https://gist.github.com/MaxRink/ee056e27a4b11a7b710e437e1f892984
After pkilling vector disk usage returned to normal

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G  8.5G   38G  19% /
```̀

### Configuration

```text
api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /var/lib/vector/
sinks:
  prometheus_metrics:
    address: 0.0.0.0:9090
    inputs:
    - internal_metrics
    type: prometheus_exporter
  vector_sink:
    address: vector-aggregator:9000
    inputs:
    - kubernetes_logs
    - annotate_hostname
    type: vector
    version: "2"
sources:
  internal_metrics:
    type: internal_metrics
  json_logfiles_var_log:
    include:
    - /var/log/kube-apiserver/kube-apiserver.log
    max_line_bytes: 1536000
    type: file
  kubernetes_logs:
    type: kubernetes_logs
  logfiles_var_log:
    exclude:
    - /var/log/pods/**/*.log
    - /var/log/containers/**/*.log
    - /var/log/kube-apiserver/kube-apiserver.log
    - /var/log/kube-apiserver/kube-apiserver*.log
    - /var/log/**/*.gz
    - /var/log/journal/**/*
    include:
    - /var/log/**/*.log
    - /var/log/private
    - /var/log/lastlog
    - /var/log/syslog
    - /var/log/btmp
    - /var/log/faillog
    - /var/log/wtmp
    - /var/log/dmesg
    max_line_bytes: 409600
    type: file
transforms:
  annotate_hostname:
    inputs:
    - logfiles*
    source: .host = "${VECTOR_SELF_NODE_NAME}"
    type: remap
  logfiles_json_decode:
    inputs:
    - json_logfiles_var_log
    source: .msg_decoded = parse_json!(.message)
    type: remap

Version

0.20.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2022-03-09T17:38:12Z

Hi @MaxRink !

Vector will hold onto deleted files until it has fully read them. Can you tell if that's what is happening here?

MaxRink · 2022-03-09T17:53:55Z

They got rotated (and after a few iterations deleted ) by logrotate or by the k8s internal apiserver logrotate

jszwedko · 2022-03-09T18:05:24Z

@MaxRink makes sense, but do you know if Vector is still reading from them? Attaching strace should tell you.

MaxRink · 2022-03-09T23:42:17Z

i dont think it is.

im not seeing any output from strace -p 1326978 -e trace=file(where 1326978 is the vector pid)

edit: forgot to do recursive, as all the io happens in child processes as it seems
but im not seeing access to the files above in there either
https://gist.github.com/MaxRink/05dd37f6d3daf7b996821eb46b85ee29

shomilj · 2023-04-22T16:49:28Z

We're also seeing this issue -- maybe it's because we're not reaching EOF for some reason? Not sure why that would be happening, though.

punkerpunker · 2023-06-13T10:23:54Z

Also faced this issue last night, vector keeps file descriptors for rotated logs, any known workaround for that?

jszwedko · 2023-06-13T17:38:37Z

#17603 may fix this.

jcantrill · 2023-08-10T12:37:39Z

👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

jszwedko · 2023-08-11T22:21:30Z

👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

The current implementation of the file source is expected to tail files until they are deleted. Once they are deleted, and EOF has been read, the handle will be released. If users are hitting a condition under which Vector has read to EOF of deleted files and still hasn't released the file handle this would be a bug. We haven't been able to reproduce that behavior as of yet though so would love a reproduction case 🙂 .

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

jszwedko · 2023-08-11T22:24:00Z

#17603 (comment) steps through how this logic of reading deleted files until EOF works in Vector.

jcantrill · 2023-08-14T13:03:42Z

@jszwedko we are using the kubernetes source? Does that matter?

jszwedko · 2023-08-16T16:44:24Z

@jszwedko we are using the kubernetes source? Does that matter?

It shouldn't. The behavior is the same since they share the same underlying file tailing implementation.

akutta · 2023-08-23T19:05:37Z

#18088 (comment)

jcantrill · 2023-09-06T14:40:40Z

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

@jszwedko I would like to revisit this as I realize the ramification of adding such a parameter is log loss; we see it now with fluentd. It's naive to think, however, the collector and host resources are infinite and the collector is always able to keep up with the log volume. We routinely push the collector on OpenShift clusters to the point where vector as unable to keep up with the load. As an admin is it better for me to "never lose a log" or to continue to collect? A configuration point would be an "opt-in" choice where admins know the trade-offs. I suspect it could be instrumented as well with a metric to identify when didn't reach EOF or something similar upon which to write an alert against.

eddy1o2 · 2023-09-08T07:42:51Z

Hi team, I also faced this issue. Have anyone solve it yet?
Basically,

the actual node disk space used by other processes is pretty small (when we run du on the node) - ours is around 10Gb
when we run lsof -nP +L1 to list non-closed file descriptors, we find ~90Gb due to vector - this is in-line with what df shows - so basically those log files are supposed to be deleted (due to log rotation) but they aren’t and still occupying disk space because vector as a process is still holding onto it

My vector specs:

Version: 0.28.0-debian
Configuration

    sources:
      kubernetes_logs:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 500
        max_read_bytes: 2048000

jszwedko · 2023-09-08T19:19:38Z

#11742 (comment) is the expectation. In your case, it means that Vector is still reading from those deleted files. If you believe that not to be the case, we would definitely appreciate a reproduction case for this as we've tried a few times but haven't been able to reproduce behavior where Vector doesn't release the file handle once it gets to EOF for a deleted file.

eddy1o2 · 2023-09-11T03:10:58Z

I see but in #11742 (comment), you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

eddy1o2 · 2023-09-11T03:12:33Z

Or it related to the fix of #18088 for kubenetes_log source?

jszwedko · 2023-09-11T07:19:29Z

I see but in #11742 (comment), you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

Ah, yes, they use the same underlying mechanisms for file reading so the behavior, in this respect, should be the same. Thanks for clarifying!

neuronull · 2023-10-17T15:30:38Z

After discussing this one with the team we came to the following conclusion:

We will keep this bug report as-is, but close it because the behavior of Vector in this case isn't really a bug (Vector is working as designed) it's just the symptom presents itself in that way when there is really an underlying performance concern at some point in the pipeline which when addressed this issue would go away.
A separate issue (Support load shedding for the file source #18863) to track the feature request of having a config option to essentially flag similar to Fluend's rotate_wait that is opt-in and would mean Vector doesn't read to EOF.
A separate issue (Add internal telemetry for identifying when the file source is reading to EOF for deleted files #18864) to track adding visibility into Vector's internal telemetry so that this situation is more easily identifiable to users.

benjaminhuo · 2024-03-27T03:34:02Z

cc @wanjunlei

MaxRink added the type: bug A code related bug. label Mar 9, 2022

jszwedko added the source: file Anything `file` source related label Mar 9, 2022

This was referenced Nov 18, 2022

Possible Memory Leak for K8s logs #14750

Open

Stuck file descriptors for rotated files #15067

Closed

polarathene mentioned this issue Mar 1, 2023

fix: Avoid creating an unnecessary syslog socket for Postfix docker-mailserver/docker-mailserver#3134

Merged

5 tasks

jszwedko mentioned this issue Jun 13, 2023

fix(file-server): while using vector with log rotation source, disk space is not freed #17603

Closed

mdbenjam mentioned this issue Oct 5, 2023

Enable load shedding for kubernetes_log source #18784

Open

This was referenced Oct 17, 2023

Support load shedding for the file source #18863

Open

Add internal telemetry for identifying when the file source is reading to EOF for deleted files #18864

Open

neuronull closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector keeps old Inodes, clogging up disk #11742

Vector keeps old Inodes, clogging up disk #11742

MaxRink commented Mar 9, 2022

jszwedko commented Mar 9, 2022

MaxRink commented Mar 9, 2022

jszwedko commented Mar 9, 2022

MaxRink commented Mar 9, 2022 •

edited

Loading

shomilj commented Apr 22, 2023

punkerpunker commented Jun 13, 2023

jszwedko commented Jun 13, 2023

jcantrill commented Aug 10, 2023 •

edited

Loading

jszwedko commented Aug 11, 2023

jszwedko commented Aug 11, 2023

jcantrill commented Aug 14, 2023

jszwedko commented Aug 16, 2023

akutta commented Aug 23, 2023

jcantrill commented Sep 6, 2023

eddy1o2 commented Sep 8, 2023 •

edited

Loading

jszwedko commented Sep 8, 2023

eddy1o2 commented Sep 11, 2023

eddy1o2 commented Sep 11, 2023

jszwedko commented Sep 11, 2023

neuronull commented Oct 17, 2023

benjaminhuo commented Mar 27, 2024

Vector keeps old Inodes, clogging up disk #11742

Vector keeps old Inodes, clogging up disk #11742

Comments

MaxRink commented Mar 9, 2022

A note for the community

Problem

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Mar 9, 2022

MaxRink commented Mar 9, 2022

jszwedko commented Mar 9, 2022

MaxRink commented Mar 9, 2022 • edited Loading

shomilj commented Apr 22, 2023

punkerpunker commented Jun 13, 2023

jszwedko commented Jun 13, 2023

jcantrill commented Aug 10, 2023 • edited Loading

jszwedko commented Aug 11, 2023

jszwedko commented Aug 11, 2023

jcantrill commented Aug 14, 2023

jszwedko commented Aug 16, 2023

akutta commented Aug 23, 2023

jcantrill commented Sep 6, 2023

eddy1o2 commented Sep 8, 2023 • edited Loading

jszwedko commented Sep 8, 2023

eddy1o2 commented Sep 11, 2023

eddy1o2 commented Sep 11, 2023

jszwedko commented Sep 11, 2023

neuronull commented Oct 17, 2023

benjaminhuo commented Mar 27, 2024

MaxRink commented Mar 9, 2022 •

edited

Loading

jcantrill commented Aug 10, 2023 •

edited

Loading

eddy1o2 commented Sep 8, 2023 •

edited

Loading