Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: fix sidecar crashing with lstat errors during a reload from a configmap update #2952

Closed
wants to merge 1 commit into from

Conversation

ebabani
Copy link

@ebabani ebabani commented Jul 28, 2020

During a reload it's possible that files are added/removed while
we walk the directory and we try to lstat a file that doesn't exist.

This is especially common if running in k8s with the configuration mounted
as a configmap, and the configmap contents change.

With this change we log when this happens but don't return the error so the
reloader process doesn't crash.

  • I added CHANGELOG entry for this change.

Changes

Add an error check during a reload for when a watch triggers on a file that gets removed. This fixes the issue reported in #2496

Verification

Added a test to reproduce the initial error and to verify the fix

@ebabani ebabani changed the title sidecar: handle lstat errors during a reload without crashing sidecar: fix sidecar crashing with lstat errors during a reload from a configmap update Jul 28, 2020
@ebabani
Copy link
Author

ebabani commented Jul 29, 2020

Linter didn't like the Canadian spelling of behaviour. Pushing a fix.

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks a lot for the contribution.
Just added some suggestions. And let's see what also think.

level.Error(r.logger).Log("msg", "path error", "err", err)
default:
// Critical error.
// TODO(bwplotka): There is no need to get process down in this case and decrease availability, handle the error in different way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should still address this as well. There's no need to get the process down. Maybe decision can be made by the caller side.

return err
switch err := errors.Cause(err).(type) {
case *os.PathError:
r.watchErrors.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make these errors more visible, we can add labels to watchErrors and increment counter by the type.

@ebabani
Copy link
Author

ebabani commented Jul 30, 2020

Will push an update tomorrow. Need to think about how to handle the errors, there's a lot of different filesystem related ones.

@ebabani
Copy link
Author

ebabani commented Jul 31, 2020

I was thinking a bit more about this and reloader errors feel pretty important since this component is responsible for rendering the final config.

With my original approach I was only handling one type of filesystem error but I bet there's others. I updated this PR to add a short retry mechanism when calculating the rules hash and it hits an error. If the error still happens after the retries it's worth returning it to the caller instead of potentially swallowing it and reporting a metric that someone might not be checking.

I'm not fully satisfied since the retry is not yet configurable(couldn't fine a nice way to pass that config) but feels better than the old approach.

…a configmap update

During a reload it's possible that files are added/removed while
we walk the directory and we try to lstat a file that doesn't exist.

This is especially common if running in k8s with the configuration mounted
as a configmap, and the configmap contents change.

With this change we will retry for a short while if we hit any
filesystem errors during the reload while trying to calculate the hash.

Signed-off-by: Ergin Babani <[email protected]>
@kakkoyun
Copy link
Member

@ebabani Could you have another look at this after the recent changes? We have an upcoming release, it'd be great to have this. We plan to cut it in today or tomorrow.

@ebabani
Copy link
Author

ebabani commented Aug 12, 2020

@kakkoyun I verified that with the recent retry updates in #2996 we avoid the original lstat problem. Since it's a temporary issue when symlinks are changing retrying on config update errors is good enough.

I think this PR can be closed since #2996 does essentially the same thing (with a nicer config 🙂 )

@kakkoyun
Copy link
Member

@ebabani Great to hear that the issue is fixed for you. Can we make use of the test that we have in this PR to prevent any regressions? If so we can merge the PR only with the tests.

@ebabani
Copy link
Author

ebabani commented Aug 13, 2020

That would be covered by the test for general FS errors at https://github.com/thanos-io/thanos/blob/master/pkg/reloader/reloader_test.go#L145-L151

@kakkoyun
Copy link
Member

Closing this since the issue fixed by #2996

@kakkoyun kakkoyun closed this Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants