-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Known Issue: Prospector reloading unfinished files #3546
Comments
There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking. In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry. In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made: * Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters * Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone. * Introduce more done channels in prospector to make shutdown more fine grained * Add system tests to verify new behaviour Closes elastic#3546
* Fix harvester shutdown for prospector reloading There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking. In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry. In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made: * Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters * Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone. * Introduce more done channels in prospector to make shutdown more fine grained * Add system tests to verify new behaviour Closes #3546 * review added
* Fix harvester shutdown for prospector reloading There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking. In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry. In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made: * Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters * Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone. * Introduce more done channels in prospector to make shutdown more fine grained * Add system tests to verify new behaviour Closes elastic#3546 * review added (cherry picked from commit 15b32e4)
* Fix harvester shutdown for prospector reloading There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking. In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry. In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made: * Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters * Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone. * Introduce more done channels in prospector to make shutdown more fine grained * Add system tests to verify new behaviour Closes #3546 * review added (cherry picked from commit 15b32e4)
Filebeat 6.2.3
filebeat config:
Only generate errors for all files under /var/log/nginx/*.log Why only fails under nginx logs? Note: I've installed on all client servers and not errors found, but when filebeat is installed on the same ELK server, I got the errors mentioned. |
Prospector reloading was introduce in 5.3. This issue is intended to describe a known issue with the implementation.
This bug affects all reloading which reloads a prospector with a file that was harvested before. If a prospector is started with new files this should not have any effect. In general the recommendation is to use reloading not to update settings like
fields
orpaths
in a prospector but to add new prospectors with new paths and to remove old ones.Example Working
subconfig.yml before
subconfig.yml after
This works because only a new prospector for
newfile.log
has to be started and the older prospector keeps running.Example NOT Working
subconfig.yml before
subconfig.yml after
The above is not working because the prospector with
test.log
has to be stopped and a new prospector withtest.log
has to be started.Technical Details
On shutdown filebeat tries to shut down as fast as possible. This means on shutdown filebeat is not waiting to complete sending all events and persisting all states to disk as it is unknown how long this will take. This can become an issue with reloading.
When a new prospector configuration option is loaded, filebeat ensures that all states which are loaded by the new prospector have the state set to finish. This is to verify that no file is harvested by 2 different prospectors at the same time as this could lead to duplicated events and unexpected behaviour.
In Filebeat in
harvester/log.go
the following code exists which is used to send events and state.This code is called the last time before Stopping a harvester with
state.Finished: true
to set the state to finished, so a new harvester can pick it up. The problem now is that it can happen, that this state is not sent because the select statement goes intoh.done
. On normal shutdown this is not an issue it is not persisted to disk if a state is Finished or not. For prospector reloading it matters because the memory states contain the info if the state is Finished. So in caseh.done
is selected instead ofh.propsecotrChan
a harvester state is never finished.The reason
h.done
is required here is that when stopping filebeat and the output is blocking, it still shuts down directly. This implies that a prospector and harvester need 2 different stop methods: One that waits for completion of sending which is used for reloading and one that shuts down immediately.Some experiments happen in #3538
The text was updated successfully, but these errors were encountered: