Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known Issue: Prospector reloading unfinished files #3546

Closed
ruflin opened this issue Feb 7, 2017 · 1 comment · Fixed by #3563
Closed

Known Issue: Prospector reloading unfinished files #3546

ruflin opened this issue Feb 7, 2017 · 1 comment · Fixed by #3563
Labels

Comments

@ruflin
Copy link
Member

ruflin commented Feb 7, 2017

Prospector reloading was introduce in 5.3. This issue is intended to describe a known issue with the implementation.

This bug affects all reloading which reloads a prospector with a file that was harvested before. If a prospector is started with new files this should not have any effect. In general the recommendation is to use reloading not to update settings like fields or paths in a prospector but to add new prospectors with new paths and to remove old ones.

Example Working

subconfig.yml before

- input_type: log
  paths:
    - /var/log/test.log
  scan_frequency: 1s

subconfig.yml after

- input_type: log
  paths:
    - /var/log/test.log
  scan_frequency: 1s
- input_type: log
  paths:
    - /var/log/newfile.log
  scan_frequency: 1s

This works because only a new prospector for newfile.log has to be started and the older prospector keeps running.

Example NOT Working

subconfig.yml before

- input_type: log
  paths:
    - /var/log/test.log
  scan_frequency: 1s

subconfig.yml after

- input_type: log
  paths:
    - /var/log/test.log
    - /var/log/newfile.log
  scan_frequency: 1s

The above is not working because the prospector with test.log has to be stopped and a new prospector with test.log has to be started.

Technical Details

On shutdown filebeat tries to shut down as fast as possible. This means on shutdown filebeat is not waiting to complete sending all events and persisting all states to disk as it is unknown how long this will take. This can become an issue with reloading.

When a new prospector configuration option is loaded, filebeat ensures that all states which are loaded by the new prospector have the state set to finish. This is to verify that no file is harvested by 2 different prospectors at the same time as this could lead to duplicated events and unexpected behaviour.

In Filebeat in harvester/log.go the following code exists which is used to send events and state.

func (h *Harvester) sendEvent(event *input.Event) bool {
	select {
	case <-h.done:
		return false
	case h.prospectorChan <- event: // ship the new event downstream
		return true
	}
}

This code is called the last time before Stopping a harvester with state.Finished: true to set the state to finished, so a new harvester can pick it up. The problem now is that it can happen, that this state is not sent because the select statement goes into h.done. On normal shutdown this is not an issue it is not persisted to disk if a state is Finished or not. For prospector reloading it matters because the memory states contain the info if the state is Finished. So in case h.done is selected instead of h.propsecotrChan a harvester state is never finished.

The reason h.done is required here is that when stopping filebeat and the output is blocking, it still shuts down directly. This implies that a prospector and harvester need 2 different stop methods: One that waits for completion of sending which is used for reloading and one that shuts down immediately.

Some experiments happen in #3538

@ruflin ruflin added bug Filebeat Filebeat labels Feb 7, 2017
ruflin added a commit to ruflin/beats that referenced this issue Feb 7, 2017
tsg pushed a commit that referenced this issue Feb 7, 2017
ruflin added a commit to ruflin/beats that referenced this issue Feb 9, 2017
There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking.

In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry.

In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made:

* Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters
* Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone.
* Introduce more done channels in prospector to make shutdown more fine grained
* Add system tests to verify new behaviour

Closes elastic#3546
urso pushed a commit that referenced this issue Feb 13, 2017
* Fix harvester shutdown for prospector reloading

There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking.

In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry.

In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made:

* Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters
* Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone.
* Introduce more done channels in prospector to make shutdown more fine grained
* Add system tests to verify new behaviour

Closes #3546

* review added
ruflin added a commit to ruflin/beats that referenced this issue Feb 14, 2017
* Fix harvester shutdown for prospector reloading

There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking.

In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry.

In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made:

* Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters
* Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone.
* Introduce more done channels in prospector to make shutdown more fine grained
* Add system tests to verify new behaviour

Closes elastic#3546

* review added

(cherry picked from commit 15b32e4)
tsg pushed a commit that referenced this issue Feb 14, 2017
* Fix harvester shutdown for prospector reloading

There are two options for stopping a harvester or a prospector. Either the harvester and prospector finish sending all events and stop them self or they are killed because the output is blocking.

In case of shutting down filebeat without using `shutdown_timeout` filebeat is expected to shut down as fast as possible. This means channels are directly closed and the events are not passed through to the registry.

In case of dynamic prospector reloading, prospectors and harvesters must be stopped properly as otherwise no new harvester for the same file can be started. To make this possible the following changes were made:

* Introduce harvester tracking in prospector to better control / manage the harvesters. The implementation is based on a harvester registry which starts and stops the harvesters
* Use an outlet to send events from harvester to prospector. This outlet has an additional signal to have two options on when the outlet should be finished. Like this the outlet can be stopped by the harvester itself or globally through closing beatDone.
* Introduce more done channels in prospector to make shutdown more fine grained
* Add system tests to verify new behaviour

Closes #3546

* review added

(cherry picked from commit 15b32e4)
@nacesprin
Copy link

Filebeat 6.2.3
Error on /var/log/filebeat/filebeat

Unable to create runner due to error: Can only start a prospector when all related states are finished: {Id: Finished:false Fileinfo:0xc420196b60 Source:/var/log/nginx/access.log Offset:169527 Timestamp:2018-04-25 12:22:05.087086821 +0200 CEST m=+932.461573499 TTL:-1ns Type:log FileStateOS:39185-2048}

filebeat config:

- type: log

  enabled: true

  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /var/log/*.log
    - /var/log/nginx/*.log
    - /var/log/elasticsearch/*.log
  tags: ["sistema"]

Only generate errors for all files under /var/log/nginx/*.log
If I remove /var/log/nginx/error.log, then get error with /var/log/nginx/access.log and so on

Why only fails under nginx logs?

Note: I've installed on all client servers and not errors found, but when filebeat is installed on the same ELK server, I got the errors mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants