Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file watch to support config reload on file change #4454

Closed
wants to merge 3 commits into from

Conversation

vjsamuel
Copy link
Contributor

Description:
Fixes: #4397

This PR allows the main config.yml to be reloaded each time the config file changes. The entire pipeline gets reloaded.

This PR doesn't use an FS notify/inotify style watcher as we have seen that Kubernetes doesnt support such watches when a config is mounted as a config file. os.Stat given that it is cheap can be run every second to trigger a reload.

@tigrannajaryan
Copy link
Member

This PR doesn't use an FS notify/inotify style watcher as we have seen that Kubernetes doesnt support such watches when a config is mounted as a config file. os.Stat given that it is cheap can be run every second to trigger a reload.

Can we measure how much of CPU polling every second uses exactly?

if os.IsNotExist(err) && lastfi != nil {
return errNoOp
}
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to trigger reloading. Why do we reload if we can't stat the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this one. thanks for catching this. as a result, i changed the flow up slightly to ensure that if someone writes a faulty config, we preserve the last sane state as long as we dont shut down the process. Such a bug would bring down the collector across the entire kube cluster if there was a faulty config map update.

// Perform an initial check.
err := check()
if err != nil && err != errNoOp {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we give up watching if the initial check fails? I think we can keep checking and reload when a change is detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have removed this.

// If check returns a valid event, exit the loop. A new watch will be placed on the next Retrieve()
err := check()
if err == nil || err != errNoOp {
onChange(&ChangeEvent{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reload the config immediately when a change is detected. That's an undesirable behavior since the file may be in the middle of being written and we may read partially written file. It is better to wait for some small amount of time (e.g. 1 second) after the last change to the file and only after that trigger reloading to increase the chance that the entire content of the file is written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by adding a sleep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is addressed, please point me to the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies. i somehow removed it during cleanup. added it back.

service/collector.go Outdated Show resolved Hide resolved
service/collector.go Outdated Show resolved Hide resolved
config/configmapprovider/file_watch.go Outdated Show resolved Hide resolved
service/collector.go Outdated Show resolved Hide resolved
@vjsamuel
Copy link
Contributor Author

Raised #4460 to fix the blocking channel

@vjsamuel
Copy link
Contributor Author

@tigrannajaryan the benchmark that I have done was using the following code:

func BenchmarkOsStat(b *testing.B) {
	file, err := ioutil.TempFile("", "file_watcher_test")
	require.NoError(b, err)

	defer os.Remove(file.Name())
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		os.Stat(file.Name())
	}
}

and the result was:

goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/collector/config/configmapprovider
cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
BenchmarkOsStat
BenchmarkOsStat-12    	  197296	      6127 ns/op	     288 B/op	       2 allocs/op
PASS

we use this logic in some of our processing intensive code flows internally and we haven't run into issues so far.

@vjsamuel vjsamuel marked this pull request as ready for review November 19, 2021 06:44
@vjsamuel vjsamuel requested review from a team and owais November 19, 2021 06:44
@vjsamuel
Copy link
Contributor Author

i have marked this PR as ready to review. it will require #4460 to be reviewed and rebased with before this one can go in if approved.

// If check returns a valid event, exit the loop. A new watch will be placed on the next Retrieve()
err := check()
if err == nil || err != errNoOp {
time.Sleep(time.Second * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this will result in reading the file 2 seconds after the first modification. A slightly better approach is to wait 2 seconds after the last modification. The difference is small but may be visible if we have small writers. It is probably fine for now.

type fileMapProvider struct {
fileName string
watching bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this? It does't seem to be set anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this one. this is to ensure that we create a watch only once as Retrieve is called each time the onChange is invoked during a file change.

close(cm.watcher)
return cm.ret.Close(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector as compared to how it was originally implemented where a config watcher is created per change in the config file. once that change was made, it didnt make sense to close the retrieved and pass the error down. i moved that logic into a get() method that follows the Retrieve() -> watch -> Close() lifecycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector

The new rearranged code is harder to follow and understand. Please refactor it to clearly show that the code follows lifecycle described in the Provider comments:

// The typical usage is the following:
//
//		r := mapProvider.Retrieve()
//		r.Get()
//		// wait for onChange() to be called.
//		r.Close()
//		r = mapProvider.Retrieve()
//		r.Get()
//		// wait for onChange() to be called.
//		r.Close()
//		// repeat Retrieve/Get/wait/Close cycle until it is time to shut down the Collector process.
//		// ...
//		mapProvider.Shutdown()

It was more visible before this change, admittedly it was not ideal but was better than what we have now. Now it is even harder to see that we are actually following the required lifecycle. All the current loop in runAndWaitForShutdownEvent shows is a watch, followed by get().

@bogdandrutu
Copy link
Member

Please rebase, and mark as resolved comments that are resolved.

@vjsamuel vjsamuel force-pushed the add_config_reload branch 2 times, most recently from 826531e to 3a36712 Compare November 24, 2021 07:45
@codecov
Copy link

codecov bot commented Nov 24, 2021

Codecov Report

Merging #4454 (2132485) into main (adca4fb) will decrease coverage by 0.07%.
The diff coverage is 77.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4454      +/-   ##
==========================================
- Coverage   90.77%   90.70%   -0.08%     
==========================================
  Files         179      179              
  Lines       10412    10468      +56     
==========================================
+ Hits         9452     9495      +43     
- Misses        743      754      +11     
- Partials      217      219       +2     
Impacted Files Coverage Δ
service/collector.go 73.91% <52.38%> (+0.02%) ⬆️
service/config_watcher.go 80.00% <79.16%> (-9.66%) ⬇️
config/configmapprovider/file.go 91.07% <88.37%> (-8.93%) ⬇️
config/configmapprovider/properties.go 89.65% <100.00%> (ø)
config/configmapprovider/simple.go 50.00% <0.00%> (-50.00%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adca4fb...2132485. Read the comment docs.

@vjsamuel vjsamuel force-pushed the add_config_reload branch 6 times, most recently from e77a604 to f907a2d Compare November 29, 2021 08:13
watchFile(ctx, fmp.fileName, onChange)
fmp.watching = true
}

return &simpleRetrieved{confMap: cp}, nil
Copy link
Member

@tigrannajaryan tigrannajaryan Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work as expected by the Provider interface. The Retrieved that is returned is expected to implement a Close function that stops watching and guarantees that onChange will not be called after that. See

// Close signals that the configuration for which it was used to retrieve values is

close(cm.watcher)
return cm.ret.Close(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have modified the code flow in a way that the config watcher is only created once in the lifecycle of the collector

The new rearranged code is harder to follow and understand. Please refactor it to clearly show that the code follows lifecycle described in the Provider comments:

// The typical usage is the following:
//
//		r := mapProvider.Retrieve()
//		r.Get()
//		// wait for onChange() to be called.
//		r.Close()
//		r = mapProvider.Retrieve()
//		r.Get()
//		// wait for onChange() to be called.
//		r.Close()
//		// repeat Retrieve/Get/wait/Close cycle until it is time to shut down the Collector process.
//		// ...
//		mapProvider.Shutdown()

It was more visible before this change, admittedly it was not ideal but was better than what we have now. Now it is even harder to see that we are actually following the required lifecycle. All the current loop in runAndWaitForShutdownEvent shows is a watch, followed by get().

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2022

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 2, 2022
@bogdandrutu bogdandrutu removed the Stale label Jan 3, 2022
@seh
Copy link

seh commented Jan 14, 2022

This PR doesn't use an FS notify/inotify style watcher as we have seen that Kubernetes doesnt support such watches when a config is mounted as a config file.

Can you clarify why this doesn't work? Tools like jimmidyson/configmap-reload attempt to detect such changes, as does the Thanos reloader used with Prometheus.

@seh
Copy link

seh commented Jan 14, 2022

This capability could also help with #1591.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 29, 2022
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 13, 2022
Copy link

@HankVeal12 HankVeal12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

configmapprovider.File does not watch for file changes
5 participants