Identity tracking of files in Filebeat inputs #13492

kvch · 2019-09-04T12:51:57Z

The current file identification of Filebeat is limited and does not support network shares well.
Right now inode and device id are used to tell files apart. But from time to time device id changes on such shares, so Filebeat rereads already processed files.

As there are many options to track file identity and there is no silver bullet to fit all use cases, this should be configurable.

Possible choices:

inode+device ID
path
fingerprint
inode+fingerprint
inode+UUID

All choices have their advantages and disadvantages:

File identity	Device ID changes	Log rotation	Computation complexity
inode+device ID	❌	✔️	O(1)
path	✔️	❌	O(1)
fingerprint	✔️	✔️	O(n) (n=`len(data)`)
inode+fingerprint	✔️	✔️	O(1)+O(n) (n=`len(data)`)

Suggested configuration format (by @urso):

filebeat.inputs:
- type: log
  file_identity.fingerprint:
    <more settings>

- type: log
  file_identity.system:  ~ // e.g. use inode+device on Unix

- type: log
  file_identity.path:  ~

- type: log
  file_identity.fingerprint_fileid:
    <more settings>

As there might be different requirements in special use cases, we intend to provide a pluggable interface so users can write their own identity tracker.

I propose the following interface:

type FileIdentifier interface {
    func SameFile(one *os.File, other *os.File) (bool, error) 
}

where SameFile is able to support both the existing os.SameFile check and/or fingerprinting the contents of the files.

Further issues we need to address:

How to switch file identifier strategies between restarts?

@urso @ph @faec WDYT?

The text was updated successfully, but these errors were encountered:

johnhoughton-v · 2019-09-04T13:07:58Z

We have an issue, but it doesn't involve network or shared volume. The volume is local and does not change. But the instance does.

USE CASE: We use terraform to script our AWS infrastructure. The log files and filebeat repo are stored on a non-root volume, e.g. /data, that is mounted on a device that accesses a AWS volume different from the root volume. When the instance is rebuilt, the non-root AWS volume is preserved and remounted to the same dir on the new instance. So, inodes are stable, but the device ID changes when the instance is rebuilt.

In my use case, inode+device ID is problematic, and simply using inode would be sufficient.

urso · 2019-09-04T14:45:16Z

In my use case, inode+device ID is problematic, and simply using inode would be sufficient.

Simply inode can lead to false positives, in case users have multiple mount points to different devices. The inode+fingerprint would use the fingerprinting to handle possible conflicts. Given that the interface is pluggable, users can provide other implementations if required.
Difficulty with Network based filesystems is: one needs to understand how the OSes, protocols, software versions, filesystem in use actually behave. There is always the chance that device ID or inode being randomized at worst.

Regarding fingerprinting complexity: The question is what 'n' is. It has not been discussed yet how the fingerprint will be computed. n Should not be the complete file, but still take the file size into account, if the file size is less than n.

I propose the following interface: ...

I wonder where/how this would be used. We also need to compute the (file.State).ID, that is used to lookup file states in the registry file.

johnhoughton-v · 2019-09-04T15:18:10Z

Simply inode can lead to false positives, in case users have multiple mount points to different devices.

Your point is valid, @urso . My use case is a specific case where I think the device id is not stable, but I think it should be. Simple solution, ignore the device id since there is only one device.

I think that the device id(s) should be the same if the same volume(s) are attached.

Perhaps having the option on linux, to use the a UUID or PARTUUID as the device id, or at least used to generate the device id? (see bash command blkid)

urso · 2019-09-04T15:27:40Z

Perhaps having the option on linux, to use the a UUID or PARTUUID as the device id, or at least used to generate the device id? (see bash command blkid)

Good idea. We should look into this. Can you check if the UUID stays stable for your use case?

johnhoughton-v · 2019-09-04T16:38:10Z

Yes, in my case, the UUID is stable. To illustrate: I terminated the AWS instance, and restarted it. (

BEFORE - instance "A":

[...@ip-xxx-xxx-xxx-203~]$ lsblk -o NAME,MOUNTPOINT,UUID /dev/nvme1n1
NAME    MOUNTPOINT UUID
nvme1n1 /data      f508bd67-6c20-4d84-9caf-e9edd18a150b

AFTER - instance "B":

[...@ip-xxx-xxx-xxx-132 ~]$ lsblk -o NAME,MOUNTPOINT,UUID /dev/nvme1n1
NAME    MOUNTPOINT UUID
nvme1n1            f508bd67-6c20-4d84-9caf-e9edd18a150b

Note: using terraform, I can guarantee that the same AWS volume will be used when the AWS instance is terminated. Since I don't run fdisk on the volume on subsequent startups, the UUID remains the same.

urso · 2019-09-05T07:50:10Z

Thank you for testing. This looks very promissing.

I think we should also investigate if we can use UUID by default and have an automated upgrade path for users coming from an older registry file. I'm not really sure about the UUID in presence of NFS of CIFS. Especially if the server is Windows, or a very old system.

Anyway, inode+UUID should definitely be an option.

kvch · 2019-09-05T10:48:53Z

I added it to the list of possible options. But it still needs more investigation.

@johnhoughton-v Thank you for the suggestion!

breml · 2019-09-06T12:28:03Z

I also like to chime in on this issue.
As reference, we are affected by #13314 and I have written the PR #13393 to ignore the device ID as well as an extension to it which allows auto-migration from an old registry state (see breml@6737c58).
Additionally we mentioned our problem on discuss.

Until now we do not exactly know, why the device ID changes in our case, but the description @johnhoughton-v provided in the comment above sounds reasonable for our case as well (even though we are not on AWS but on the infrastructure of a local cloud provider.

I will try to find out, if the UUID generated by blkid stays the same in our infrastructure as well (but this can take some time, because the problem only hits us after maintenance windows of our cloud provider, which do not happen very often).

I have a suggestion for an other approach for a device ID alternative. What about a special marker file (on *nix this could be a hidden file), which sits next to the files that are indexed by filebeat (or maybe filebeat could traverse the path up until root to find such a file). The content would be a unique id, that is used instead of the device ID. For the situations @johnhoughton-v and we do face, this would solve the problem as well, because in our cases the block device with all its contents stays the same, but the system (maybe due to a rebuild of the instance) generates a new device ID for the same block device. Therefore, this marker file would be sufficient to identify the block device / file system as the same.

breml · 2019-09-06T12:33:55Z

For reference, related issues:

Add support for network volumes in Filebeat #5876 - Add support for network volumes in Filebeat
Use file path instead of inode as identifier in the registry #4368 - Use file path instead of inode as identifier in the registry
Add contents based hash to the filebeat regsitry for detecting inode reuse #11277 - Add contents based hash to the filebeat regsitry for detecting inode reuse
filebeat flag to disable deviceId as part of the file identification in the registry #13314 - filebeat flag to disable deviceId as part of the file identification in the registry

As well as discussions on discuss:

breml · 2019-09-12T15:32:46Z

@kvch @urso I would like to know, what the next steps are on this ticket. How I can help you on this one. This issue and the resolution of our problem is important for us so we are willing to help e.g. with the implementation. Please let me know.

breml · 2019-10-14T13:30:49Z

I kindly ask again, how I can help to get some progress on this issue.

johnhoughton-v · 2020-05-22T13:53:11Z

We just had a significant production incident cause by this issue so, I decided to check back in to see if there has been any progress. Can anyone provide an update on where this stands?

breml · 2020-05-22T14:15:26Z

@johnhoughton-v For me this topic is also still relevant, but there has not progress, no reaction, nothing for quite some time even though I provided a PR and offered to help. So I do not really have any hope on this topic.

urso · 2020-05-25T13:23:40Z

Creating a marker file sounds interesting, but would be a little difficult to coordinate between multiple inputs with the current architecture (I think we can improve on this on the future). For now it might be easier if we ask for the marker file to exist already before we start collecting. All in all I'd prefer if Beats would not need write access to files or directories.

johnhoughton-v · 2020-05-25T14:49:26Z

Why not be a marker file be something that the user provides? Then, no write access is necessary. If, in the config, we provided the name of the file, file eat could read the contents of that file. We could populate that file with some unique value, like a UU ID of the device. The marker file could be read once on start up of filebeat.

…

On May 25, 2020, at 9:23 AM, Steffen Siering ***@***.***> wrote: Creating a marker file sounds interesting, but would be a little difficult to coordinate between multiple inputs with the current architecture (I think we can improve on this on the future). For now it might be easier if we ask for the marker file to exist already before we start collecting. All in all I'd prefer if Beats would not need write access to files or directories. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

urso · 2020-05-26T12:24:47Z

Why not be a marker file be something that the user provides? Then, no write access is necessary.

If, in the config, we provided the name of the file, file eat could read the contents of that file.

We could populate that file with some unique value, like a UU ID of the device. The marker file could be read once on start up of filebeat.

Yeah, this is what I I have had in mind by saying "For now it might be easier if we ask for the marker file to exist already before we start collecting".

A random UUID might be good enough. We will definitely need to document examples on how to create the 'device' file.

If the file is missing we would not collect from the directory until it is present. The thing I wonder is: do we want the file to be present in each directory, or just somewhere in the parent directories?

johnhoughton-v · 2020-05-26T14:17:05Z

I think that a marker/sentinel file would be fine from our/the user end. Here are a few thoughts:

if the file doesn't exist or can't be read, then don't use it: just use log that state, and just the inode. (that's basically the use case that we're asking for anyway)
if it exist and is readable, read it and hash the contents to get a "device indicator". From the device indicator, you can derive the internal device_id that filebeat already uses.
As a user, I would produce the file for my "/data" volume using something like this (it would contain the UUID of the device that holds /data, and then store it in the base dir of the files I'm prospecting):

blkid -o value $(df --output=source /data | tail -1) | head -1 > /data/output/filebeat_marker.dat

When I configure a filebeat input, I would specify an optional marker file for that input

- type: log
  paths:
    - /data/output/*/*log
  marker_file:
    - /data/output/filebeat_marker.dat

To answer your previous question, I think that a sentinel/marker file should affect to all files identified in a given input stanza, as suggested above.

kvch · 2020-05-26T16:02:20Z

I have started to work on this feature. The first PR is still in progress, but it can be tracked here: #18748

andresrc · 2020-07-15T08:45:01Z

can we close this issue?

kvch · 2020-07-16T13:27:51Z

Follow up in #19990

breml · 2020-07-20T08:24:11Z

@kvch Thank you for your effort, this new feature is highly appreciated and I am looking forward to test it in our environment.

kvch · 2020-07-20T11:37:40Z

@breml Thank you for your kind words! I am looking forward to your feedback.

kvch added discuss Issue needs further discussion. Filebeat Filebeat labels Sep 4, 2019

kvch self-assigned this Sep 4, 2019

kvch mentioned this issue Sep 4, 2019

Add ignore_device_id config flag to filebeat #13393

Closed

urso added the Team:Beats label Dec 27, 2019

andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020

kvch mentioned this issue Jul 8, 2020

Filebeat network shares support #19736

Closed

2 tasks

kvch mentioned this issue Jul 16, 2020

Identify files based on their contents in Filebeat #19990

Closed

kvch closed this as completed Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity tracking of files in Filebeat inputs #13492

Identity tracking of files in Filebeat inputs #13492

kvch commented Sep 4, 2019 •

edited

Loading

johnhoughton-v commented Sep 4, 2019

urso commented Sep 4, 2019

johnhoughton-v commented Sep 4, 2019

urso commented Sep 4, 2019

johnhoughton-v commented Sep 4, 2019

urso commented Sep 5, 2019

kvch commented Sep 5, 2019

breml commented Sep 6, 2019

breml commented Sep 6, 2019 •

edited

Loading

breml commented Sep 12, 2019

breml commented Oct 14, 2019

johnhoughton-v commented May 22, 2020

breml commented May 22, 2020

urso commented May 25, 2020

johnhoughton-v commented May 25, 2020 via email

urso commented May 26, 2020

johnhoughton-v commented May 26, 2020

kvch commented May 26, 2020

andresrc commented Jul 15, 2020

kvch commented Jul 16, 2020

breml commented Jul 20, 2020

kvch commented Jul 20, 2020

Identity tracking of files in Filebeat inputs #13492

Identity tracking of files in Filebeat inputs #13492

Comments

kvch commented Sep 4, 2019 • edited Loading

johnhoughton-v commented Sep 4, 2019

urso commented Sep 4, 2019

johnhoughton-v commented Sep 4, 2019

urso commented Sep 4, 2019

johnhoughton-v commented Sep 4, 2019

urso commented Sep 5, 2019

kvch commented Sep 5, 2019

breml commented Sep 6, 2019

breml commented Sep 6, 2019 • edited Loading

breml commented Sep 12, 2019

breml commented Oct 14, 2019

johnhoughton-v commented May 22, 2020

breml commented May 22, 2020

urso commented May 25, 2020

johnhoughton-v commented May 25, 2020 via email

urso commented May 26, 2020

johnhoughton-v commented May 26, 2020

kvch commented May 26, 2020

andresrc commented Jul 15, 2020

kvch commented Jul 16, 2020

breml commented Jul 20, 2020

kvch commented Jul 20, 2020

kvch commented Sep 4, 2019 •

edited

Loading

breml commented Sep 6, 2019 •

edited

Loading