Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity tracking of files in Filebeat inputs #13492

Closed
1 task
kvch opened this issue Sep 4, 2019 · 22 comments
Closed
1 task

Identity tracking of files in Filebeat inputs #13492

kvch opened this issue Sep 4, 2019 · 22 comments
Assignees
Labels
discuss Issue needs further discussion. Filebeat Filebeat Team:Integrations Label for the Integrations team

Comments

@kvch
Copy link
Contributor

kvch commented Sep 4, 2019

The current file identification of Filebeat is limited and does not support network shares well.
Right now inode and device id are used to tell files apart. But from time to time device id changes on such shares, so Filebeat rereads already processed files.

As there are many options to track file identity and there is no silver bullet to fit all use cases, this should be configurable.

Possible choices:

  • inode+device ID
  • path
  • fingerprint
  • inode+fingerprint
  • inode+UUID

All choices have their advantages and disadvantages:

File identity Device ID changes Log rotation Computation complexity
inode+device ID ✔️ O(1)
path ✔️ O(1)
fingerprint ✔️ ✔️ O(n) (n=len(data))
inode+fingerprint ✔️ ✔️ O(1)+O(n) (n=len(data))

Suggested configuration format (by @urso):

filebeat.inputs:
- type: log
  file_identity.fingerprint:
    <more settings>

- type: log
  file_identity.system:  ~ // e.g. use inode+device on Unix

- type: log
  file_identity.path:  ~

- type: log
  file_identity.fingerprint_fileid:
    <more settings>

As there might be different requirements in special use cases, we intend to provide a pluggable interface so users can write their own identity tracker.

I propose the following interface:

type FileIdentifier interface {
    func SameFile(one *os.File, other *os.File) (bool, error) 
}

where SameFile is able to support both the existing os.SameFile check and/or fingerprinting the contents of the files.

Further issues we need to address:

  • How to switch file identifier strategies between restarts?

@urso @ph @faec WDYT?

@kvch kvch added discuss Issue needs further discussion. Filebeat Filebeat labels Sep 4, 2019
@kvch kvch self-assigned this Sep 4, 2019
@johnhoughton-v
Copy link

We have an issue, but it doesn't involve network or shared volume. The volume is local and does not change. But the instance does.

USE CASE: We use terraform to script our AWS infrastructure. The log files and filebeat repo are stored on a non-root volume, e.g. /data, that is mounted on a device that accesses a AWS volume different from the root volume. When the instance is rebuilt, the non-root AWS volume is preserved and remounted to the same dir on the new instance. So, inodes are stable, but the device ID changes when the instance is rebuilt.

In my use case, inode+device ID is problematic, and simply using inode would be sufficient.

@urso
Copy link

urso commented Sep 4, 2019

In my use case, inode+device ID is problematic, and simply using inode would be sufficient.

Simply inode can lead to false positives, in case users have multiple mount points to different devices. The inode+fingerprint would use the fingerprinting to handle possible conflicts. Given that the interface is pluggable, users can provide other implementations if required.
Difficulty with Network based filesystems is: one needs to understand how the OSes, protocols, software versions, filesystem in use actually behave. There is always the chance that device ID or inode being randomized at worst.

Regarding fingerprinting complexity: The question is what 'n' is. It has not been discussed yet how the fingerprint will be computed. n Should not be the complete file, but still take the file size into account, if the file size is less than n.

I propose the following interface: ...

I wonder where/how this would be used. We also need to compute the (file.State).ID, that is used to lookup file states in the registry file.

@johnhoughton-v
Copy link

Simply inode can lead to false positives, in case users have multiple mount points to different devices.

Your point is valid, @urso . My use case is a specific case where I think the device id is not stable, but I think it should be. Simple solution, ignore the device id since there is only one device.

I think that the device id(s) should be the same if the same volume(s) are attached.

Perhaps having the option on linux, to use the a UUID or PARTUUID as the device id, or at least used to generate the device id? (see bash command blkid)

@urso
Copy link

urso commented Sep 4, 2019

Perhaps having the option on linux, to use the a UUID or PARTUUID as the device id, or at least used to generate the device id? (see bash command blkid)

Good idea. We should look into this. Can you check if the UUID stays stable for your use case?

@johnhoughton-v
Copy link

Yes, in my case, the UUID is stable. To illustrate: I terminated the AWS instance, and restarted it. (

BEFORE - instance "A":

[...@ip-xxx-xxx-xxx-203~]$ lsblk -o NAME,MOUNTPOINT,UUID /dev/nvme1n1
NAME    MOUNTPOINT UUID
nvme1n1 /data      f508bd67-6c20-4d84-9caf-e9edd18a150b

AFTER - instance "B":

[...@ip-xxx-xxx-xxx-132 ~]$ lsblk -o NAME,MOUNTPOINT,UUID /dev/nvme1n1
NAME    MOUNTPOINT UUID
nvme1n1            f508bd67-6c20-4d84-9caf-e9edd18a150b

Note: using terraform, I can guarantee that the same AWS volume will be used when the AWS instance is terminated. Since I don't run fdisk on the volume on subsequent startups, the UUID remains the same.

@urso
Copy link

urso commented Sep 5, 2019

Thank you for testing. This looks very promissing.

I think we should also investigate if we can use UUID by default and have an automated upgrade path for users coming from an older registry file. I'm not really sure about the UUID in presence of NFS of CIFS. Especially if the server is Windows, or a very old system.

Anyway, inode+UUID should definitely be an option.

@kvch
Copy link
Contributor Author

kvch commented Sep 5, 2019

I added it to the list of possible options. But it still needs more investigation.

@johnhoughton-v Thank you for the suggestion!

@breml
Copy link
Contributor

breml commented Sep 6, 2019

I also like to chime in on this issue.
As reference, we are affected by #13314 and I have written the PR #13393 to ignore the device ID as well as an extension to it which allows auto-migration from an old registry state (see breml@6737c58).
Additionally we mentioned our problem on discuss.

Until now we do not exactly know, why the device ID changes in our case, but the description @johnhoughton-v provided in the comment above sounds reasonable for our case as well (even though we are not on AWS but on the infrastructure of a local cloud provider.

I will try to find out, if the UUID generated by blkid stays the same in our infrastructure as well (but this can take some time, because the problem only hits us after maintenance windows of our cloud provider, which do not happen very often).

I have a suggestion for an other approach for a device ID alternative. What about a special marker file (on *nix this could be a hidden file), which sits next to the files that are indexed by filebeat (or maybe filebeat could traverse the path up until root to find such a file). The content would be a unique id, that is used instead of the device ID. For the situations @johnhoughton-v and we do face, this would solve the problem as well, because in our cases the block device with all its contents stays the same, but the system (maybe due to a rebuild of the instance) generates a new device ID for the same block device. Therefore, this marker file would be sufficient to identify the block device / file system as the same.

@breml
Copy link
Contributor

breml commented Sep 6, 2019

For reference, related issues:

As well as discussions on discuss:

@breml
Copy link
Contributor

breml commented Sep 12, 2019

@kvch @urso I would like to know, what the next steps are on this ticket. How I can help you on this one. This issue and the resolution of our problem is important for us so we are willing to help e.g. with the implementation. Please let me know.

@breml
Copy link
Contributor

breml commented Oct 14, 2019

I kindly ask again, how I can help to get some progress on this issue.

@andresrc andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020
@johnhoughton-v
Copy link

We just had a significant production incident cause by this issue so, I decided to check back in to see if there has been any progress. Can anyone provide an update on where this stands?

@breml
Copy link
Contributor

breml commented May 22, 2020

@johnhoughton-v For me this topic is also still relevant, but there has not progress, no reaction, nothing for quite some time even though I provided a PR and offered to help. So I do not really have any hope on this topic.

@urso
Copy link

urso commented May 25, 2020

Creating a marker file sounds interesting, but would be a little difficult to coordinate between multiple inputs with the current architecture (I think we can improve on this on the future). For now it might be easier if we ask for the marker file to exist already before we start collecting. All in all I'd prefer if Beats would not need write access to files or directories.

@johnhoughton-v
Copy link

johnhoughton-v commented May 25, 2020 via email

@urso
Copy link

urso commented May 26, 2020

Why not be a marker file be something that the user provides? Then, no write access is necessary.

If, in the config, we provided the name of the file, file eat could read the contents of that file.

We could populate that file with some unique value, like a UU ID of the device. The marker file could be read once on start up of filebeat.

Yeah, this is what I I have had in mind by saying "For now it might be easier if we ask for the marker file to exist already before we start collecting".

A random UUID might be good enough. We will definitely need to document examples on how to create the 'device' file.

If the file is missing we would not collect from the directory until it is present. The thing I wonder is: do we want the file to be present in each directory, or just somewhere in the parent directories?

@johnhoughton-v
Copy link

I think that a marker/sentinel file would be fine from our/the user end. Here are a few thoughts:

  1. if the file doesn't exist or can't be read, then don't use it: just use log that state, and just the inode. (that's basically the use case that we're asking for anyway)

  2. if it exist and is readable, read it and hash the contents to get a "device indicator". From the device indicator, you can derive the internal device_id that filebeat already uses.

  3. As a user, I would produce the file for my "/data" volume using something like this (it would contain the UUID of the device that holds /data, and then store it in the base dir of the files I'm prospecting):

blkid -o value $(df --output=source /data | tail -1) | head -1 > /data/output/filebeat_marker.dat
  1. When I configure a filebeat input, I would specify an optional marker file for that input
- type: log
  paths:
    - /data/output/*/*log
  marker_file:
    - /data/output/filebeat_marker.dat
  1. To answer your previous question, I think that a sentinel/marker file should affect to all files identified in a given input stanza, as suggested above.

@kvch
Copy link
Contributor Author

kvch commented May 26, 2020

I have started to work on this feature. The first PR is still in progress, but it can be tracked here: #18748

@andresrc
Copy link
Contributor

can we close this issue?

@kvch
Copy link
Contributor Author

kvch commented Jul 16, 2020

Follow up in #19990

@breml
Copy link
Contributor

breml commented Jul 20, 2020

@kvch Thank you for your effort, this new feature is highly appreciated and I am looking forward to test it in our environment.

@kvch
Copy link
Contributor Author

kvch commented Jul 20, 2020

@breml Thank you for your kind words! I am looking forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Filebeat Filebeat Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

5 participants