Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit "how file beat works", close options, and symlinks config #2562

Merged
merged 2 commits into from
Sep 19, 2016

Conversation

dedemorton
Copy link
Contributor

No description provided.

@dedemorton
Copy link
Contributor Author

dedemorton commented Sep 15, 2016

This adds edits for work tracked by #2482

@monicasarbu
Copy link
Contributor

@ruflin can you please have a look?

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I left some minor comments. There are 2-3 things that changed since I wrote the docs.


[float]
[[reduce-registry-size]]
=== Registry file is too large?

Filebeat keeps all states of the files and persists the states on disk in the `registry_file`. The states are used to continue file reading at a previous position in case filebeat is restarted. In case every day a large amount of new files is constantly produced, the registry file grows over time. To reduce the size of the registry file, there are two configuration variables: `clean_removed` and `close_inactive`.
Filebeat keeps the state of each file and persists the state to disk in the `registry_file`. The file state is used to continue file reading at a previous position when Filebeat is restarted. If a large number of new files are produced every day, the registry file might grow to be too large. To reduce the size of the registry file, there are two configuration options available: <<clean-removed,`clean_removed`>> and <<close-inactive,`close_inactive`>>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mistake was already in the previous entry. It should be clean_inactive and not close_inactive :-(

* <<close-renamed,`close_renamed`>>
* <<close-removed,`close_removed`>>
* <<close-eof,`close_eof`>>
* <<close-timeout,`close_timeout`>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we added the config option harvester_limit it would be nice to also list it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dedemorton We should address this in an other PR.


A harvester is responsible to read the content of a single file. This is done by reading each file line by line and send the content to the output. For each file one harvester is started. The harvester is responsible to open and close files. That means, as long as a harvester is running, the file descriptor stays open. Even if a file is removed or renamed, filebeat will keep reading the file. This has the side affect that the space on your disk will be reserved until the harvester is stopped.
A harvester is responsible for reading the content of a single file. The harvester reads each file, line by line, and sends the content to the output. One harvester is started for each file. The harvester is responsible for opening and closing the file, which means that the file descriptor remains open while the harvester is running. If a file is removed or renamed while it's being harvested, Filebeat continues to read the file. This has the side effect that the space on your disk is reserved until the harvester closes. By default, Filebeat keeps the file open for harvesting for 5 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last setenced: "By default, Filebeat keeps the file open until close_inactive is reached to send new events in near real time."

We should not use 5 minutes here to make it more generic in case we change the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin I'll mention the close_inactive option instead of a value. However, we probably shouldn't say "in near real time" here because that won't be true if the user changes the close_inactive setting to a much lower number, right?

@@ -29,15 +39,14 @@ filebeat.prospectors:
- /var/path2/*.log
-------------------------------------------------------------------------------------

Filebeat currently supports two `prospector` types: `log` and `stdin`. Each prospector type can be defined multiple times. The log prospector checks for each file if a harvester has to be started, if one is already running or the file can be ignored because of configuration options which were set. New files are only picked up, if the offset / size of the file changed since the harvester was closed.
Filebeat currently supports two `prospector` types: `log` and `stdin`. Each prospector type can be defined multiple times. The `log` prospector checks each file to see whether a harvester needs to be started, whether one is already running, or whether the file can be ignored (see <<ignore-older,`ignore_older`>>). New files are only picked up if the offset or size of the file has changed since the harvester was closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only use size here: " offset or size ". offset / size were meant as being the same, but reading again I see that this is confusing. I think size is the more common one.

Requirement: ignore_older > close_inactive

Before a file can be ignored by the prospector, it must be closed. To ensure a file is not harvested anymore when it is ignored, ignore_older must be set to a longer duration then `close_inactive`. It can happen, that a file is still harvested but already falls under `ignore_older` as the harvester didn't finish yet. The harvester will finish reading and close it after `close_inactive` is reached.
If a file that's currently being harvested falls under `ignore_older`, the harvester will finish reading the file and close it after `close_inactive` is reached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the harvester will first finish reading ... is reached. Only after that the file will be ignored."



Requirement: ignore_older > close_inactive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed this line. Any idea how we could keep this in somehow (perhaps in a nicer way?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed redundant to me because you repeated it in the text. I thought maybe it was a note that you forgot to remove. Also, the angle bracket is a bit ambiguous. I realize it means "greater than", but users could potentially misinterpret this (for example, someone might think it means "set this to"). Gotta account for all the ways that people can misinterpret things. :-) When we want to call something out as important, we use "important" notes. I'll flag this this one.



[[clean-options]]
===== clean_*

The `clean_*` variables are used to clean up the state entries. This helps to reduce the size of the registry file and can prevent a potential <<inode-reuse-issue>>. These options are disabled by default as wrong settings can lead to data duplication as complete log files are sent again.
The `clean_*` options are used to clean up the state entries in the registry file. These settings help to reduce the size of the registry file and can prevent a potential <<inode-reuse-issue,inode reuse issue>>. These options are disabled by default because incorrect settings can lead to data duplication when complete log files are sent again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"These options are disabled" is not correct anymore, as "clean_removed" is enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll remove the line entirely because we warn the user whenever specific options might be problematic.

@ruflin ruflin merged commit 46fe81e into elastic:master Sep 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants