-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sidecar: fix sidecar crashing with lstat errors during a reload from a configmap update #2952
Conversation
Linter didn't like the Canadian spelling of behaviour. Pushing a fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Thanks a lot for the contribution.
Just added some suggestions. And let's see what also think.
pkg/reloader/reloader.go
Outdated
level.Error(r.logger).Log("msg", "path error", "err", err) | ||
default: | ||
// Critical error. | ||
// TODO(bwplotka): There is no need to get process down in this case and decrease availability, handle the error in different way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should still address this as well. There's no need to get the process down. Maybe decision can be made by the caller side.
pkg/reloader/reloader.go
Outdated
return err | ||
switch err := errors.Cause(err).(type) { | ||
case *os.PathError: | ||
r.watchErrors.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make these errors more visible, we can add labels to watchErrors
and increment counter by the type.
Will push an update tomorrow. Need to think about how to handle the errors, there's a lot of different filesystem related ones. |
I was thinking a bit more about this and reloader errors feel pretty important since this component is responsible for rendering the final config. With my original approach I was only handling one type of filesystem error but I bet there's others. I updated this PR to add a short retry mechanism when calculating the rules hash and it hits an error. If the error still happens after the retries it's worth returning it to the caller instead of potentially swallowing it and reporting a metric that someone might not be checking. I'm not fully satisfied since the retry is not yet configurable(couldn't fine a nice way to pass that config) but feels better than the old approach. |
…a configmap update During a reload it's possible that files are added/removed while we walk the directory and we try to lstat a file that doesn't exist. This is especially common if running in k8s with the configuration mounted as a configmap, and the configmap contents change. With this change we will retry for a short while if we hit any filesystem errors during the reload while trying to calculate the hash. Signed-off-by: Ergin Babani <[email protected]>
@ebabani Could you have another look at this after the recent changes? We have an upcoming release, it'd be great to have this. We plan to cut it in today or tomorrow. |
@kakkoyun I verified that with the recent retry updates in #2996 we avoid the original lstat problem. Since it's a temporary issue when symlinks are changing retrying on config update errors is good enough. I think this PR can be closed since #2996 does essentially the same thing (with a nicer config 🙂 ) |
@ebabani Great to hear that the issue is fixed for you. Can we make use of the test that we have in this PR to prevent any regressions? If so we can merge the PR only with the tests. |
That would be covered by the test for general FS errors at https://github.com/thanos-io/thanos/blob/master/pkg/reloader/reloader_test.go#L145-L151 |
Closing this since the issue fixed by #2996 |
During a reload it's possible that files are added/removed while
we walk the directory and we try to lstat a file that doesn't exist.
This is especially common if running in k8s with the configuration mounted
as a configmap, and the configmap contents change.
With this change we log when this happens but don't return the error so the
reloader process doesn't crash.
Changes
Add an error check during a reload for when a watch triggers on a file that gets removed. This fixes the issue reported in #2496
Verification
Added a test to reproduce the initial error and to verify the fix