elastic · tsg · Jul 18, 2016 · Jul 5, 2016
diff --git a/filebeat/docs/how-filebeat-works.asciidoc b/filebeat/docs/how-filebeat-works.asciidoc
@@ -0,0 +1,43 @@
+== How Filebeat works
+
+This page intends to explain what the key building blocks of Filebeat are and how they work together. The goal is that the available configuration options can be better understood and more informed decisions can be made about the optimal configuration options for the different use cases.
+
+[float]
+=== What is a harvester?
+
+A harvester is responsible to read the content of a single file. This is done by reading each file line by line and send the content to the output. For each file one harvester is started. The harvester is responsible to open and close files. That means, as long as a harvester is running, the file descriptor stays open. Even if a file is removed or renamed, filebeat will keep reading the file. This has the side affect that the space on your disk will be reserved until the harvester is stopped.
+
+Stopping a harvester has the following consequences:
+
+* The file handler is closed, freeing up the underlying resources if the file was deleted
+* The harvesting of the file will only be started again after `scan_frequency`
+* In case the file was moved / removed, harvesting the file will not continue
+
+To define more in detail when a harvester is closed, use the <<close-options>> configuration options.
+
+[float]
+=== What is a prospector?
+
+A prospector is reponsible to manage the harvesters.  The responsibility of the prospector is to find all sources it should read from. In case of the log input type it is to find all files on the drive based on the defined paths and start a harvester for each file that was found. Each prospector runs in its own go routine and is responsible to manage the harvesters. Each prospector configuration looks as following:
+
+[source,yaml]
+-------------------------------------------------------------------------------------
+filebeat.prospectors:
+- input_type: log
+  paths:
+    - /var/log/*.log
+    - /var/path2/*.log
+-------------------------------------------------------------------------------------
+
+Filebeat currently supports two `prospector` types: `log` and `stdin`. Each prospector type can be defined multiple times. The log prospector checks for each file if a harvester has to be started, if one is already running or the file can be ignored because of configuration options which were set. New files are only picked up, if the offset / size of the file changed since the harvester was closed.
+
+[float]
+=== What is a state?
+
+Filebeat keeps the state of each file and frequently flushes the state to disk to the registry file. The state is used to remember the last offset a harvester was reading from and to ensure all log lines are sent. In case Elasticsearch/Logstash are not reachable, Filebeat will keep track of the last lines sent and will continue reading the files as soon Elasicsearch/Logstash becomes available again. As long as filebeat is running, the state information is also kept in memory by each prospector. In case of a filebeat restart, the data from the registry file is used to rebuild the state and filebeat continues each harvester at the last known position.
+
+Each prospector keeps a state for each file it finds. As files can be renamed or moved, the filename and path are not enough to identify a file. For each file unique identifiers are stored to detect if a file the same file that was harvested previously.
+
+For more details on how to configure the state, checkout out the <<reduce-registry-size>> troubleshooting guide.
+
+
diff --git a/filebeat/docs/index.asciidoc b/filebeat/docs/index.asciidoc
@@ -17,6 +17,8 @@ include::./overview.asciidoc[]
 
 include::./getting-started.asciidoc[]
 
+include::./how-filebeat-works.asciidoc[]
+
 include::./command-line.asciidoc[]
 
 include::../../libbeat/docs/shared-directory-layout.asciidoc[]

diff --git a/filebeat/docs/reference/configuration/filebeat-options.asciidoc b/filebeat/docs/reference/configuration/filebeat-options.asciidoc
@@ -164,38 +164,115 @@ names added by Filebeat, then the custom fields overwrite the other fields.
 [[ignore-older]]
 ===== ignore_older
 
-If this option is specified, Filebeat
-ignores any files that were modified before the specified timespan. This is disabled by default.
+If this option is specified, Filebeat ignores any files that were modified before the specified timespan. This is disabled by default.
 
 You can use time strings like 2h (2 hours) and 5m (5 minutes). The default is 0, which means disable.
 Commenting out the config has the same affect as setting it to 0.
 
-Files which were falling under ignore_older and are updated again, will start
-from the offset the file was at when it was last ignored by ignore_older. As an example:
-A file was not modified for 90 hours and the offset is at 200. Now a new line is added and
-the last modification date is updated. After scan_frequency detects the change the crawling
-starts at the offset 200. In case the file was falling under ignore_older already when filebeat
-was started, the first 200 lines are never sent. In case filebeat was started earlier, the 200
-chars were already sent and it now continues at the old offset.
+There are two different cases for files which fall under ignore_older:
 
+* Files which were never harvested
+* Files which were harvested but weren't updated for longer then ignore_older
 
-===== close_older
+In case a file was never harvested before and is updated, the reading will start from the beginning of the file as no state was persisted so far. For files which were harvested perviously, the state still exists and in case of an update, they will be continued at the last position.
 
-After a file was not modified for the duration of close_older, the file handle will be closed.
-After closing the file, a file change will only be detected after scan_frequency instead of almost
-instant.
+For comparison, `ignore_older` relies on the modification time of the file. In case the modification time of files is not updated when its written to a file (can happen on Windows), `ignore_older` will start to ignore the file even though it could be that content was added at a later time.
+
+`ignore_older` can be especially useful if you keep log files for a long time and you start filebeat, but only want to send the newest files to elasticsearch and the old files from the last week, but not all files.
+
+To remove the state from the registry file for files which were harvested before, the `clean_idle` configuration option has to be used.
+
+
+Requirement: ignore_older > close_idle
+
+Before a file can be ignored by the prospector, it must be closed. To ensure a file is not harvested anymore when it is ignored, ignore_older must be set to a longer duration then `close_idle`. It can happen, that a file is still harvested but already falls under `ignore_older` as the harvester didn't finish yet. The harvester will finish reading and close it after `close_idle` is reached.
+
+[[close-options]]
+===== close_*
+
+All `close_*` configuration options are used to close the harvester after a certain criteria or time. Closing the harvester means closing the file handler. In case a file is updated again after the harvester is closed, it will be picked up again after <<scan-frequency>>. It is important to understand, in case the file was moved away or deleted during this period, filebeat will not be able to pick up the file again and any data that the harvester didn't read so far is lost.
+
+[[close-idle]]
+===== close_idle
+
+After a file was not harvested for the duration of `close_idle`, the file handle will be closed. The counter for the defined period starts when the last log line was read by the harvester, it is not based on the modification time of the file. In case the closed file changes again, a new harvester is started again, latest after `scan_frequency`.
+
+It is recommended to set `close_idle` to a value that is larger then the least frequent updates to your log file. In case your log file gets updated every few seconds, you can safely set it to `1m`. If there are log files with very different update rates, multiple prospector configurations with different values can be used.
+
+Setting `close_idle` to a lower value means file handles are closed faster but has the side affect that new log lines are not sent in near real time in case the harvester was closed.
+
+The timestamp for closing a file does not depend on the modification time of the file but an internal timestamp that is update when the file was last harvested. If `close_idle` is set to 5 minutes, the countdown for the 5 minutes starts the last time the harvester read a line from the file.
 
 You can use time strings like 2h (2 hours) and 5m (5 minutes). The default is 1h.
 
 
+===== close_renamed
+
+WARNING: Only use this options if you understand the potential side affects with potential data loss.
+
+This option allows a file handler to be closed when it is renamed. This happens for example when rotating files. By default, the harvester stays open and keeps reading the file as the file handler does not depend on the file name. If this option is enabled and the file was renamed/moved in such a way that it is not part of the prospector patterns, the file will not be picked up again. Filebeat will not finish reading the file.
+
+WINDOWS: In case under windows your log rotation system shows errors because it can't rotated the files, this is the option to enabled.
+
+===== close_removed
+
+WARNING: Only use this options if you understand the potential side affects with potential data loss.
+
+Close removed can be used to close a harvester directly when a file is removed. Normally a file should only be removed after it already falls under `close_idle`. In case files are removed early, without this option filebeat keeps the file open to make sure finishing is completed. In case the file handle should be released immediately after removal, this option can be used.
+
+
+WINDOWS: In case under windows your log rotation system shows error because it can't rotated the files, this is the option to enabled.
+
+
+===== close_eof
+
+WARNING: Only use this options if you understand the potential side affects with potential data loss.
+
+Close eof closes a file as soon as the end of a file is reached. This is useful in case your files are only written once and not updated from time to time. This case can happen in case you are writing every single log event to a new file.
+
+===== close_timeout
+
+WARNING: Only use this options if you understand the potential side affects with potential data loss.
+
+Close timeout gives every harvester a predefined lifetime. Independent of the location of the reader, it will stop the reader after `close_timeout`. This option can be useful, if only a predefine time should be spent on older log files. Using this option in combination with `ignore_older` == `close_timeout` means the file is not picked up again in case it wasn't modified in between. This normally leads to data loss and not the complete file is sent.
+
+[[clean-options]]
+===== clean_*
+
+The `clean_*` variables are used to clean up the state entries. This helps to reduce the size of the registry file and can prevent a potential <<inode-reuse-issue>>. These options are disabled by default as wrong settings can lead to data duplicatin as complete log files are sent again.
+
+===== clean_idle
+
+WARNING: Only use this options if you understand the potential side affects with potential data loss.
+
+`clean_idle` removes the state of the file after the given period. The state for files can only be removed if the file is already ignored by filebeat, means it's falling under `ignore_older`. The requirement for clean idle is `clean_idle > ignore_older + scan_frequency` to make sure no states are removed when a file is still harvested. Otherwise it could lead to resending the full content constantly as clean_idle removes state for files which are still detected by the prospector. In case a file is updated or appears again, the file is read from the beginning.
+
+The `clean_idle` configuration option is useful to reduce the size of the registry file, especially if a large amount of new files are generated every day.
+
+In addition this config option is useful to prevent the <<inode-reuse-issue>>. If a file is deleted, the inode can be reused by a newly created file. If the inode is the same, filebeat assumes to know the file and continues at the old position. As this issues gets more probable over time, it is good to cleanup the old states to make sure filebeat does not assume it already knows the file.
+
+NOTE: Every time a file is renamed, the file state will be updated and the counter for `clean_idle` will start at 0 again.
+
+===== clean_removed
+
+WARNING: Only use this options if you understand the potential side affects.
+
+
+Cleans files which cannot be found on disk anymore. This does not apply to renamed files or files which were moved to an other directory which is still visible by filebeat. Be aware that this option will remove the state for removed files immediately. In case a shared drive disappeared for a short period and appears again, all files will be read again from the beginninh. This option can only be used in combination `close_removed`.
+
+
+[[scan-frequency]]
 ===== scan_frequency
 
 How often the prospector checks for new files in the paths that are specified
 for harvesting. For example, if you specify a glob like `/var/log/*`, the
 directory is scanned for files using the frequency specified by
 `scan_frequency`. Specify 1s to scan the directory as frequently as possible
-without causing Filebeat to scan too frequently. The default setting is
-10s.
+without causing Filebeat to scan too frequently. We do not recommend to set this value `<1s`.
+
+If you require log lines to be sent in near real time do not use a very low `scan_frequency` but adjust `close_idle` so the file handler stays open and constantly polls your files.
+
+The default setting is 10s.
 
 ===== document_type
 
@@ -322,6 +399,8 @@ will never exceed `max_backoff` regardless of what is specified for  `backoff_fa
 Because it takes a maximum of 10s to read a new line, specifying 10s for `max_backoff` means that, at the worst, a new line could be added to the log file if Filebeat has
 backed off multiple times. The default is 10s.
 
+Requirement: max_backoff should always be set to `max_backoff <= scan_frequency`. In case `max_backoff` should be bigger, it is recommended to close the file handler instead let the prospector pick up the file again.
+
 ===== backoff_factor
 
 This option specifies how fast the waiting time is increased. The bigger the
@@ -331,15 +410,6 @@ the backoff algorithm is disabled, and the `backoff` value is used for waiting f
 lines. The `backoff` value will be multiplied each time with the `backoff_factor` until
 `max_backoff` is reached. The default is 2.
 
-===== force_close_files
-
-By default, Filebeat keeps the files that it’s reading open until the timespan specified by `close_older` has elapsed.
-This behaviour can cause issues when a file is removed. On Windows, the file cannot be fully removed until Filebeat closes
-the file. In addition no new file with the same name can be created during this time.
-
-You can force Filebeat to close the file as soon as the file name changes by setting the
-`force_close_files` option to true. The default is false. Turning on this option can lead to loss of data on
-rotated files in case not all lines were read from the rotated file.
 
 [[configuration-global-options]]
 === Filebeat Global Configuration

diff --git a/filebeat/docs/troubleshooting.asciidoc b/filebeat/docs/troubleshooting.asciidoc
@@ -12,6 +12,51 @@ following tips.
 
 include::../../libbeat/docs/getting-help.asciidoc[]
 
+
+== Reduce open file handlers
+
+Filebeat keeps the file handler open in case it reaches the end of a file to read new log lines in near real time. If filebeat is harvesting a large number of files, the number of open files can be become an issue. In most environments, the number of files which are actively updated is low. The configuration `close_idle` should be set accordingly to close files which are not active any more.
+
+There are 4 more configuration options which can be used to close file handlers, but all of them should be used carefully as they can side affects. The options are:
+
+* close_renamed
+* close_removed
+* close_eof
+* close_timeout
+
+`close_renamed` and `close_removed` can be useful on Windows and issues related to file rotation, see <<windows-file-rotation>>. `close_eof` can be useful in environments with a large number of files with only very few entries. `close_timeout` in environments where it is more important to close file handlers then to send all log lines. More details can be found in config options, see <<configuration-filebeat-options>>.
+
+Before using any of these variables, make sure to study the documentation on each.
+
+
+[[reduce-registry-size]]
+== Reduce Registry File Size
+
+Filebeat keeps all states of the files and persists the states on disk in the `registry_file`. The states are used to continue file reading at a previous position in case filebeat is restarted. In case every day a large amount of new files is constantly produced, the registry file grows over time. To reduce the size of the registry file, there are two configuration variables: `clean_removed` and `clean_idle`.
+
+In case old files are not touched anymore and fall under `ignore_older`, it is recommended to use `clean_idle`. If on the other size old files get removed from disk `clean_removed` can be used.
+
+[[inode-reuse-issue]]
+== Inode Reuse Issue
+
+Filebeat uses under linux inode and device to identify files. In case a file is removed from disk, the inode can again be assigned to a new file. In the case of file rotation where and old file is removed and a new one is directly created afterwards, it can happen that the new files has the exact same inode. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position which is not correct.
+
+By default states are never removed from the registry file. In case of inode reuse issue it is recommended to use the `clean_*` options, especially `clean_idle`. In case your files get rotated every 24 hours and the rotated files rotated files are not updated anymore, `ignore_older` could be set to 48 hours and `clean_idle` 72 hours.
+
+`clean_removed` can be used for files that are removed from disk. Be aware that `clean_removed` also applies if during one scan a file cannot be found anymore. In case the file shows up at a later stage again, it will be sent again from scratch.
+
+[[windows-file-rotation]]
+== Windows File Rotation
+
+Under Windows it can happen, that files cannot be renamed / removed as long as filebeat keeps the file handler open. This can lead to issues with the file rotating system. To reduce this issue, the options `close_removed` and `close_renamed` can be used together.
+
+It is important to understand, that these two options mean files are closed before the harvester finished reading the file. In case the file cannot be picked up again by the prospector and the harvester didn't finish reading the file, the missing lines will never be sent to elasticsearch.
+
+
+
+
+
+
 [[enable-filebeat-debugging]]
 == Debugging