-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file watch to support config reload on file change #4454
Conversation
Can we measure how much of CPU polling every second uses exactly? |
if os.IsNotExist(err) && lastfi != nil { | ||
return errNoOp | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to trigger reloading. Why do we reload if we can't stat the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this one. thanks for catching this. as a result, i changed the flow up slightly to ensure that if someone writes a faulty config, we preserve the last sane state as long as we dont shut down the process. Such a bug would bring down the collector across the entire kube cluster if there was a faulty config map update.
// Perform an initial check. | ||
err := check() | ||
if err != nil && err != errNoOp { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we give up watching if the initial check fails? I think we can keep checking and reload when a change is detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have removed this.
// If check returns a valid event, exit the loop. A new watch will be placed on the next Retrieve() | ||
err := check() | ||
if err == nil || err != errNoOp { | ||
onChange(&ChangeEvent{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will reload the config immediately when a change is detected. That's an undesirable behavior since the file may be in the middle of being written and we may read partially written file. It is better to wait for some small amount of time (e.g. 1 second) after the last change to the file and only after that trigger reloading to increase the chance that the entire content of the file is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by adding a sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this is addressed, please point me to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies. i somehow removed it during cleanup. added it back.
Raised #4460 to fix the blocking channel |
@tigrannajaryan the benchmark that I have done was using the following code:
and the result was:
we use this logic in some of our processing intensive code flows internally and we haven't run into issues so far. |
i have marked this PR as ready to review. it will require #4460 to be reviewed and rebased with before this one can go in if approved. |
// If check returns a valid event, exit the loop. A new watch will be placed on the next Retrieve() | ||
err := check() | ||
if err == nil || err != errNoOp { | ||
time.Sleep(time.Second * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this will result in reading the file 2 seconds after the first modification. A slightly better approach is to wait 2 seconds after the last modification. The difference is small but may be visible if we have small writers. It is probably fine for now.
type fileMapProvider struct { | ||
fileName string | ||
watching bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this? It does't seem to be set anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this one. this is to ensure that we create a watch only once as Retrieve
is called each time the onChange
is invoked during a file change.
close(cm.watcher) | ||
return cm.ret.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector as compared to how it was originally implemented where a config watcher is created per change in the config file. once that change was made, it didnt make sense to close the retrieved and pass the error down. i moved that logic into a get()
method that follows the Retrieve()
-> watch
-> Close()
lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector
The new rearranged code is harder to follow and understand. Please refactor it to clearly show that the code follows lifecycle described in the Provider
comments:
// The typical usage is the following:
//
// r := mapProvider.Retrieve()
// r.Get()
// // wait for onChange() to be called.
// r.Close()
// r = mapProvider.Retrieve()
// r.Get()
// // wait for onChange() to be called.
// r.Close()
// // repeat Retrieve/Get/wait/Close cycle until it is time to shut down the Collector process.
// // ...
// mapProvider.Shutdown()
It was more visible before this change, admittedly it was not ideal but was better than what we have now. Now it is even harder to see that we are actually following the required lifecycle. All the current loop in runAndWaitForShutdownEvent shows is a watch, followed by get()
.
Please rebase, and mark as resolved comments that are resolved. |
826531e
to
3a36712
Compare
Codecov Report
@@ Coverage Diff @@
## main #4454 +/- ##
==========================================
- Coverage 90.77% 90.70% -0.08%
==========================================
Files 179 179
Lines 10412 10468 +56
==========================================
+ Hits 9452 9495 +43
- Misses 743 754 +11
- Partials 217 219 +2
Continue to review full report at Codecov.
|
Incorporate review comments Add sleep to the onChange
e77a604
to
f907a2d
Compare
f907a2d
to
2132485
Compare
watchFile(ctx, fmp.fileName, onChange) | ||
fmp.watching = true | ||
} | ||
|
||
return &simpleRetrieved{confMap: cp}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work as expected by the Provider
interface. The Retrieved
that is returned is expected to implement a Close
function that stops watching and guarantees that onChange
will not be called after that. See
// Close signals that the configuration for which it was used to retrieve values is |
close(cm.watcher) | ||
return cm.ret.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector
The new rearranged code is harder to follow and understand. Please refactor it to clearly show that the code follows lifecycle described in the Provider
comments:
// The typical usage is the following:
//
// r := mapProvider.Retrieve()
// r.Get()
// // wait for onChange() to be called.
// r.Close()
// r = mapProvider.Retrieve()
// r.Get()
// // wait for onChange() to be called.
// r.Close()
// // repeat Retrieve/Get/wait/Close cycle until it is time to shut down the Collector process.
// // ...
// mapProvider.Shutdown()
It was more visible before this change, admittedly it was not ideal but was better than what we have now. Now it is even harder to see that we are actually following the required lifecycle. All the current loop in runAndWaitForShutdownEvent shows is a watch, followed by get()
.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Can you clarify why this doesn't work? Tools like jimmidyson/configmap-reload attempt to detect such changes, as does the Thanos reloader used with Prometheus. |
This capability could also help with #1591. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description:
Fixes: #4397
This PR allows the main config.yml to be reloaded each time the config file changes. The entire pipeline gets reloaded.
This PR doesn't use an FS notify/inotify style watcher as we have seen that Kubernetes doesnt support such watches when a config is mounted as a config file. os.Stat given that it is cheap can be run every second to trigger a reload.