Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector keeps old Inodes, clogging up disk #11742

Closed
MaxRink opened this issue Mar 9, 2022 · 21 comments
Closed

Vector keeps old Inodes, clogging up disk #11742

MaxRink opened this issue Mar 9, 2022 · 21 comments
Labels
source: file Anything `file` source related type: bug A code related bug.

Comments

@MaxRink
Copy link

MaxRink commented Mar 9, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We ran into the issue that Vector is keeping old Inodes open, thus filling up the disk

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G   49G     0 100% /

INodes: https://gist.github.com/MaxRink/ee056e27a4b11a7b710e437e1f892984
After pkilling vector disk usage returned to normal

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G  8.5G   38G  19% /
```̀

### Configuration

```text
api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /var/lib/vector/
sinks:
  prometheus_metrics:
    address: 0.0.0.0:9090
    inputs:
    - internal_metrics
    type: prometheus_exporter
  vector_sink:
    address: vector-aggregator:9000
    inputs:
    - kubernetes_logs
    - annotate_hostname
    type: vector
    version: "2"
sources:
  internal_metrics:
    type: internal_metrics
  json_logfiles_var_log:
    include:
    - /var/log/kube-apiserver/kube-apiserver.log
    max_line_bytes: 1536000
    type: file
  kubernetes_logs:
    type: kubernetes_logs
  logfiles_var_log:
    exclude:
    - /var/log/pods/**/*.log
    - /var/log/containers/**/*.log
    - /var/log/kube-apiserver/kube-apiserver.log
    - /var/log/kube-apiserver/kube-apiserver*.log
    - /var/log/**/*.gz
    - /var/log/journal/**/*
    include:
    - /var/log/**/*.log
    - /var/log/private
    - /var/log/lastlog
    - /var/log/syslog
    - /var/log/btmp
    - /var/log/faillog
    - /var/log/wtmp
    - /var/log/dmesg
    max_line_bytes: 409600
    type: file
transforms:
  annotate_hostname:
    inputs:
    - logfiles*
    source: .host = "${VECTOR_SELF_NODE_NAME}"
    type: remap
  logfiles_json_decode:
    inputs:
    - json_logfiles_var_log
    source: .msg_decoded = parse_json!(.message)
    type: remap

Version

0.20.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@MaxRink MaxRink added the type: bug A code related bug. label Mar 9, 2022
@jszwedko
Copy link
Member

jszwedko commented Mar 9, 2022

Hi @MaxRink !

Vector will hold onto deleted files until it has fully read them. Can you tell if that's what is happening here?

@jszwedko jszwedko added the source: file Anything `file` source related label Mar 9, 2022
@MaxRink
Copy link
Author

MaxRink commented Mar 9, 2022

They got rotated (and after a few iterations deleted ) by logrotate or by the k8s internal apiserver logrotate

@jszwedko
Copy link
Member

jszwedko commented Mar 9, 2022

@MaxRink makes sense, but do you know if Vector is still reading from them? Attaching strace should tell you.

@MaxRink
Copy link
Author

MaxRink commented Mar 9, 2022

i dont think it is.

im not seeing any output from strace -p 1326978 -e trace=file(where 1326978 is the vector pid)

edit: forgot to do recursive, as all the io happens in child processes as it seems
but im not seeing access to the files above in there either
https://gist.github.com/MaxRink/05dd37f6d3daf7b996821eb46b85ee29

@shomilj
Copy link

shomilj commented Apr 22, 2023

We're also seeing this issue -- maybe it's because we're not reaching EOF for some reason? Not sure why that would be happening, though.

@punkerpunker
Copy link
Contributor

Also faced this issue last night, vector keeps file descriptors for rotated logs, any known workaround for that?

@jszwedko
Copy link
Member

#17603 may fix this.

@jcantrill
Copy link

jcantrill commented Aug 10, 2023

👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

@jszwedko
Copy link
Member

👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

The current implementation of the file source is expected to tail files until they are deleted. Once they are deleted, and EOF has been read, the handle will be released. If users are hitting a condition under which Vector has read to EOF of deleted files and still hasn't released the file handle this would be a bug. We haven't been able to reproduce that behavior as of yet though so would love a reproduction case 🙂 .

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

@jszwedko
Copy link
Member

#17603 (comment) steps through how this logic of reading deleted files until EOF works in Vector.

@jcantrill
Copy link

@jszwedko we are using the kubernetes source? Does that matter?

@jszwedko
Copy link
Member

@jszwedko we are using the kubernetes source? Does that matter?

It shouldn't. The behavior is the same since they share the same underlying file tailing implementation.

@akutta
Copy link
Contributor

akutta commented Aug 23, 2023

#18088 (comment)

@jcantrill
Copy link

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

@jszwedko I would like to revisit this as I realize the ramification of adding such a parameter is log loss; we see it now with fluentd. It's naive to think, however, the collector and host resources are infinite and the collector is always able to keep up with the log volume. We routinely push the collector on OpenShift clusters to the point where vector as unable to keep up with the load. As an admin is it better for me to "never lose a log" or to continue to collect? A configuration point would be an "opt-in" choice where admins know the trade-offs. I suspect it could be instrumented as well with a metric to identify when didn't reach EOF or something similar upon which to write an alert against.

@eddy1o2
Copy link

eddy1o2 commented Sep 8, 2023

Hi team, I also faced this issue. Have anyone solve it yet?
Basically,

  • the actual node disk space used by other processes is pretty small (when we run du on the node) - ours is around 10Gb
  • when we run lsof -nP +L1 to list non-closed file descriptors, we find ~90Gb due to vector - this is in-line with what df shows - so basically those log files are supposed to be deleted (due to log rotation) but they aren’t and still occupying disk space because vector as a process is still holding onto it
    image

My vector specs:

  • Version: 0.28.0-debian
  • Configuration
    sources:
      kubernetes_logs:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 500
        max_read_bytes: 2048000

@jszwedko
Copy link
Member

jszwedko commented Sep 8, 2023

#11742 (comment) is the expectation. In your case, it means that Vector is still reading from those deleted files. If you believe that not to be the case, we would definitely appreciate a reproduction case for this as we've tried a few times but haven't been able to reproduce behavior where Vector doesn't release the file handle once it gets to EOF for a deleted file.

@eddy1o2
Copy link

eddy1o2 commented Sep 11, 2023

I see but in #11742 (comment), you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

@eddy1o2
Copy link

eddy1o2 commented Sep 11, 2023

Or it related to the fix of #18088 for kubenetes_log source?

@jszwedko
Copy link
Member

I see but in #11742 (comment), you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

Ah, yes, they use the same underlying mechanisms for file reading so the behavior, in this respect, should be the same. Thanks for clarifying!

@neuronull
Copy link
Contributor

After discussing this one with the team we came to the following conclusion:

  • We will keep this bug report as-is, but close it because the behavior of Vector in this case isn't really a bug (Vector is working as designed) it's just the symptom presents itself in that way when there is really an underlying performance concern at some point in the pipeline which when addressed this issue would go away.
  • A separate issue (Support load shedding for the file source #18863) to track the feature request of having a config option to essentially flag similar to Fluend's rotate_wait that is opt-in and would mean Vector doesn't read to EOF.
  • A separate issue (Add internal telemetry for identifying when the file source is reading to EOF for deleted files #18864) to track adding visibility into Vector's internal telemetry so that this situation is more easily identifiable to users.

@benjaminhuo
Copy link

cc @wanjunlei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: file Anything `file` source related type: bug A code related bug.
Projects
None yet
9 participants