Skip to content

Commit

Permalink
Edit "how file beat works", close options, and symlinks config (#2562)
Browse files Browse the repository at this point in the history
* Edit how file beat works and close options

* Fix issues from review
  • Loading branch information
dedemorton authored and ruflin committed Sep 19, 2016
1 parent be26fad commit 46fe81e
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 82 deletions.
32 changes: 17 additions & 15 deletions filebeat/docs/faq.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -35,44 +35,46 @@ it's publishing events successfully:
[[open-file-handlers]]
== Too many open file handlers?

Filebeat keeps the file handler open in case it reaches the end of a file to read new log lines in near real time. If filebeat is harvesting a large number of files, the number of open files can be become an issue. In most environments, the number of files which are actively updated is low. The configuration `close_inactive` should be set accordingly to close files which are not active any more.
Filebeat keeps the file handler open in case it reaches the end of a file so that it can read new log lines in near real time. If Filebeat is harvesting a large number of files, the number of open files can become an issue. In most environments, the number of files that are actively updated is low. The `close_inactive` configuration option should be set accordingly to close files that are no longer active.

There are 4 more configuration options which can be used to close file handlers, but all of them should be used carefully as they can side affects. The options are:
There are additional configuration options that you can use to close file handlers, but all of them should be used carefully because they can have side effects. The options are:

* close_renamed
* close_removed
* close_eof
* close_timeout
* <<close-renamed,`close_renamed`>>
* <<close-removed,`close_removed`>>
* <<close-eof,`close_eof`>>
* <<close-timeout,`close_timeout`>>
* <<harvester-limit,`harvester_limit`>>

`close_renamed` and `close_removed` can be useful on Windows and issues related to file rotation, see <<windows-file-rotation>>. `close_eof` can be useful in environments with a large number of files with only very few entries. `close_timeout` in environments where it is more important to close file handlers then to send all log lines. More details can be found in config options, see <<configuration-filebeat-options>>.
The `close_renamed` and `close_removed` options can be useful on Windows to resolve issues related to file rotation. See <<windows-file-rotation>>. The `close_eof` option can be useful in environments with a large number of files that have only very few entries. The `close_timeout` option is useful in environments where closing file handlers is more important than sending all log lines. For more details, see <<configuration-filebeat-options>>.

Before using any of these variables, make sure to study the documentation on each.
Make sure that you read the documentation for these configuration options before using any of them.

[float]
[[reduce-registry-size]]
=== Registry file is too large?

Filebeat keeps all states of the files and persists the states on disk in the `registry_file`. The states are used to continue file reading at a previous position in case filebeat is restarted. In case every day a large amount of new files is constantly produced, the registry file grows over time. To reduce the size of the registry file, there are two configuration variables: `clean_removed` and `close_inactive`.
Filebeat keeps the state of each file and persists the state to disk in the `registry_file`. The file state is used to continue file reading at a previous position when Filebeat is restarted. If a large number of new files are produced every day, the registry file might grow to be too large. To reduce the size of the registry file, there are two configuration options available: <<clean-removed,`clean_removed`>> and <<clean-inactive,`clean_inactive`>>.

For old files that you no longer touch and are ignored (see <<ignore-older,`ignore_older`>>), we recommended that you use `clean_inactive`. If old files get removed from disk, then use the `clean_removed` option.

In case old files are not touched anymore and fall under `ignore_older`, it is recommended to use `clean_inactive`. If on the other size old files get removed from disk `clean_removed` can be used.

[float]
[[inode-reuse-issue]]
=== Inode reuse causes Filebeat to skip lines?

Filebeat uses under linux inode and device to identify files. In case a file is removed from disk, the inode can again be assigned to a new file. In the case of file rotation where and old file is removed and a new one is directly created afterwards, it can happen that the new files has the exact same inode. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position which is not correct.
On Linux file systems, Filebeat uses the inode and device to identify files. When a file is removed from disk, the inode may be assigned to a new file. In use cases involving file rotation, if an old file is removed and a new one is created immediately afterwards, the new file may have the exact same inode as the file that was removed. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position, which is not correct.

By default states are never removed from the registry file. In case of inode reuse issue it is recommended to use the `clean_*` options, especially `clean_inactive`. In case your files get rotated every 24 hours and the rotated files rotated files are not updated anymore, `ignore_older` could be set to 48 hours and `clean_inactive` 72 hours.
By default states are never removed from the registry file. To resolve the inode reuse issue, we recommend that you use the <<clean-options,`clean_*`>> options, especially <<clean-inactive,`clean_inactive`>>, to remove the state of inactive files. For example, if your files get rotated every 24 hours, and the rotated files are not updated anymore, you can set <<ignore-older,`ignore_older`>> to 48 hours and <<clean-inactive,`clean_inactive`>> to 72 hours.

`clean_removed` can be used for files that are removed from disk. Be aware that `clean_removed` also applies if during one scan a file cannot be found anymore. In case the file shows up at a later stage again, it will be sent again from scratch.
You can use <<clean-removed,`clean_removed`>> for files that are removed from disk. Be aware that `clean_removed` cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.

[float]
[[windows-file-rotation]]
=== Open file handlers cause issues with Windows file rotation?

Under Windows it can happen, that files cannot be renamed / removed as long as filebeat keeps the file handler open. This can lead to issues with the file rotating system. To reduce this issue, the options `close_removed` and `close_renamed` can be used together.
On Windows, you might have problems renaming or removing files because Filebeat keeps the file handlers open. This can lead to issues with the file rotating system. To avoid this issue, you can use the <<close-removed,`close_removed`>> and <<close-renamed,`close_renamed`>> options together.

It is important to understand, that these two options mean files are closed before the harvester finished reading the file. In case the file cannot be picked up again by the prospector and the harvester didn't finish reading the file, the missing lines will never be sent to elasticsearch.
IMPORTANT: When you configure these options, files may be closed before the harvester has finished reading the files. If the file cannot be picked up again by the prospector and the harvester hasn't finish reading the file, the missing lines will never be sent to the output.


[float]
Expand Down
43 changes: 26 additions & 17 deletions filebeat/docs/how-filebeat-works.asciidoc
Original file line number Diff line number Diff line change
@@ -1,24 +1,34 @@
== How Filebeat works
[[how-filebeat-works]]
== How Filebeat Works

In this topic, you learn about the key building blocks of Filebeat and how they work together. Understanding these concepts will help you make informed decisions about configuring Filebeat for specific use cases.

Filebeat consists of two main components: <<prospector,prospectors>> and <<harvester,harvesters>>. These components work together to tail files and send event data to the output that you specify.

This page intends to explain what the key building blocks of Filebeat are and how they work together. The goal is that the available configuration options can be better understood and more informed decisions can be made about the optimal configuration options for the different use cases.

[float]
=== What is a harvester?
[[harvester]]
=== What is a Harvester?

A harvester is responsible to read the content of a single file. This is done by reading each file line by line and send the content to the output. For each file one harvester is started. The harvester is responsible to open and close files. That means, as long as a harvester is running, the file descriptor stays open. Even if a file is removed or renamed, filebeat will keep reading the file. This has the side affect that the space on your disk will be reserved until the harvester is stopped.
A harvester is responsible for reading the content of a single file. The harvester reads each file, line by line, and sends the content to the output. One harvester is started for each file. The harvester is responsible for opening and closing the file, which means that the file descriptor remains open while the harvester is running. If a file is removed or renamed while it's being harvested, Filebeat continues to read the file. This has the side effect that the space on your disk is reserved until the harvester closes. By default, Filebeat keeps the file open until <<close-inactive,`close_inactive`>> is reached.

Stopping a harvester has the following consequences:
Closing a harvester has the following consequences:

* The file handler is closed, freeing up the underlying resources if the file was deleted
* The harvesting of the file will only be started again after `scan_frequency`
* In case the file was moved / removed, harvesting the file will not continue
* The file handler is closed, freeing up the underlying resources if the file was deleted while the harvester was still reading the file.
* The harvesting of the file will only be started again after <<scan-frequency,`scan_frequency`>> has elapsed.
* If the file is moved or removed while the harvester is closed, harvesting of the file will not continue.

To define more in detail when a harvester is closed, use the <<close-options>> configuration options.
To control when a harvester is closed, use the <<close-options,`close_*`>> configuration options.

[float]
=== What is a prospector?
[[prospector]]
=== What is a Prospector?

A prospector is responsible for managing the harvesters and finding all sources to read from.

A prospector is reponsible to manage the harvesters. The responsibility of the prospector is to find all sources it should read from. In case of the log input type it is to find all files on the drive based on the defined paths and start a harvester for each file that was found. Each prospector runs in its own go routine and is responsible to manage the harvesters. Each prospector configuration looks as following:
If the input type is `log`, the prospector finds all files on the drive that match the defined glob paths and starts a harvester for each file. Each prospector runs in its own Go routine.

The following example configures Filebeat to harvest lines from all log files that match the specified glob patterns:

[source,yaml]
-------------------------------------------------------------------------------------
Expand All @@ -29,15 +39,14 @@ filebeat.prospectors:
- /var/path2/*.log
-------------------------------------------------------------------------------------

Filebeat currently supports two `prospector` types: `log` and `stdin`. Each prospector type can be defined multiple times. The log prospector checks for each file if a harvester has to be started, if one is already running or the file can be ignored because of configuration options which were set. New files are only picked up, if the offset / size of the file changed since the harvester was closed.
Filebeat currently supports two `prospector` types: `log` and `stdin`. Each prospector type can be defined multiple times. The `log` prospector checks each file to see whether a harvester needs to be started, whether one is already running, or whether the file can be ignored (see <<ignore-older,`ignore_older`>>). New files are only picked up if the size of the file has changed since the harvester was closed.

[float]
=== What is a state?

Filebeat keeps the state of each file and frequently flushes the state to disk to the registry file. The state is used to remember the last offset a harvester was reading from and to ensure all log lines are sent. In case Elasticsearch/Logstash are not reachable, Filebeat will keep track of the last lines sent and will continue reading the files as soon Elasicsearch/Logstash becomes available again. As long as filebeat is running, the state information is also kept in memory by each prospector. In case of a filebeat restart, the data from the registry file is used to rebuild the state and filebeat continues each harvester at the last known position.
=== How Does Filebeat Keep the State of Files?

Each prospector keeps a state for each file it finds. As files can be renamed or moved, the filename and path are not enough to identify a file. For each file unique identifiers are stored to detect if a file the same file that was harvested previously.
Filebeat keeps the state of each file and frequently flushes the state to disk in the registry file. The state is used to remember the last offset a harvester was reading from and to ensure all log lines are sent. If the output, such as Elasticsearch or Logstash, is not reachable, Filebeat keeps track of the last lines sent and will continue reading the files as soon as the output becomes available again. While Filebeat is running, the state information is also kept in memory by each prospector. When Filebeat is restarted, data from the registry file is used to rebuild the state, and Filebeat continues each harvester at the last known position.

For more details on how to configure the state, checkout out the <<reduce-registry-size>> troubleshooting guide.
Each prospector keeps a state for each file it finds. Because files can be renamed or moved, the filename and path are not enough to identify a file. For each file, Filebeat stores unique identifiers to detect whether a file was harvested previously.

If your use case involves creating a large number of new files every day, you might find that the registry file grows to be too large. See <<reduce-registry-size>> for details about configuration options that you can set to resolve this issue.

2 changes: 2 additions & 0 deletions filebeat/docs/overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,7 @@ Here's how Filebeat works: When you start Filebeat, it starts one or more prospe

image:./images/filebeat.png[Beats design]

For more information about prospectors and harvesters, see <<how-filebeat-works>>.

Filebeat is a https://www.elastic.co/products/beats[Beat], and it is based on the libbeat framework.
General information about libbeat and setting up Elasticsearch, Logstash, and Kibana are covered in the {libbeat}/index.html[Beats Platform Reference].
Loading

0 comments on commit 46fe81e

Please sign in to comment.