-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent prospector scans from racing and leaking harvesters #2539
Conversation
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
1 similar comment
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
@viveklak Thanks for taking this on. Can you share some more thoughts on your solution and what the problem is you fixed? |
@@ -179,6 +183,7 @@ func (p *ProspectorLog) scan() { | |||
// Decides if previous state exists | |||
if lastState.IsEmpty() { | |||
logp.Debug("prospector", "Start harvester for new file: %s", newState.Source) | |||
p.Prospector.states.Update(newState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New states should not be persisted before the harvester is started as during the startup of a harvester also some checks will be made and only then the state will be sent to the prospector.
One thing that I could see happen is, that as all harvesters are started in a go routine and only then persist the state, that in case this takes too long to persist all new states, that a scan happens again and a new harvester is started again. This would require a rather low scan_frequency. There are some system tests that try to emulate this and so far it seems this "race" didn't happen. The reason your PR above would prevent this case is not the mutex but the state update. Unfortunately this will bring some other problems. It would be really great if we could find a way to more or less reliably reproduce this. I will try to write a test case with a very low scan_frequency and already a very large number of files on startup. |
I think I can also see how this relates in https://discuss.elastic.co/t/filebeat-5-0-0-alpha5-multiple-harvesters-for-same-file-sending-the-same-messages-over-and-over/59866 to a blocking kafka output. As the through which the state updates are sent is blocked by the output, the new states are not sent form the harvesters which are started. As no states are sent but scans continue, it will start the same harvester for the file again and again as the state was never processed. |
@ruflin You understood correctly. The core of the problem is that the registration of the harvester with the prospector happens asynchronously (the harvester itself is started in a go routine and sends its state on a shared channel to the prospector). If there is a significant enough delay in the harvester returning its initial state to the prospector, the prospector may start another harvester for the same source. The delay could be due to a large number of existing harvesters competing on the same channel to send state back, a blocking output, etc. Looks like registration of the initial state within the prospector seems to have helped in my case but as you said, might have other issues. |
@viveklak I opened here a PR with a potential fix: #2541 The main downside / upside now is, scans can be blocked if the output is blocked. On the bright side this prevents further harvesters to be started and keep them open. On the less bright side that means that in case output becomes available again, old files will be finished sending and new files were potentially rotated away in the meantime. It would be great if you could test my PR to see if it helps in your case. |
In case newly started harvesters did not persist their first state before the next scan started, it could have happened that multiple harvesters were started for the same file. This could have been cause by a large number of files or the output blocking. The problem is solve that the Setup step of the Harvester is now synchronus and blocking the scan. Part of this is also updating the first state of the as part of the prospector. The side affect of this change is that now a scan is blocking in case the channel is blocked which means the output is probably not responding. If the output is not responding, scans will not continue and new files will not be discovered until output is available again. The code can be further simplified in the future by merging create/startHarvester. This will be done in a second step to keep backport commit to a minimum. See also elastic#2539
In case newly started harvesters did not persist their first state before the next scan started, it could have happened that multiple harvesters were started for the same file. This could have been cause by a large number of files or the output blocking. The problem is solve that the Setup step of the Harvester is now synchronus and blocking the scan. Part of this is also updating the first state of the as part of the prospector. The side affect of this change is that now a scan is blocking in case the channel is blocked which means the output is probably not responding. If the output is not responding, scans will not continue and new files will not be discovered until output is available again. The code can be further simplified in the future by merging create/startHarvester. This will be done in a second step to keep backport commit to a minimum. See also #2539
@ruflin awesome! Will update with the results of my test. Thanks for the quick turn around! |
@ruflin Ran a test overnight and didn't see the issue anymore. Thanks again for jumping on the problem! |
@viveklak Great to hear. |
Relevant to https://discuss.elastic.co/t/filebeat-5-0-0-alpha5-multiple-harvesters-for-same-file-sending-the-same-messages-over-and-over