-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[new feature] promtail: Add config reload endoint / signal to promtail #7247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this addition. I think this is a good idea for promtail, I just have a couple thoughts and questions.
clients/pkg/promtail/promtail.go
Outdated
return promtail, nil | ||
} | ||
|
||
func (p *Promtail) reloadConfig(cfg config.Config) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not a pointer to config.Config? It seems like we're mutating it within this function (ie. cfg.PositionsConfig.ReadOnly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for such fast feedback,done.
promtailServer := p.server.(*server.PromtailServer) | ||
for { | ||
select { | ||
case <-hup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, not super familiar with these live reload patterns, can you explain the SIGHUP flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refer to the reload documentation of prometheus, we should also provide this reload way.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/
A configuration reload is triggered by sending a SIGHUP.
@@ -693,7 +693,7 @@ func Test_DryRun(t *testing.T) { | |||
PositionsFile: f.Name(), | |||
SyncPeriod: time.Second, | |||
}, | |||
}, clientMetrics, false, nil) | |||
}, nil, clientMetrics, false, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a test that this new configReload function is called? Not sure if we need to go so far as the whole config gets reloaded, but maybe at least the function is called, and test the flow where watch config is disabled when the function is not provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
hi, your review suggestion is great, through this test I found a bug that "PromtailServer.promtailCfg" was not updated correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice feature @liguozhong!
This is a very delicate feature which has to be handled and communicated very carefully. I've left a bunch of small nits relating to wording, and identified a couple areas that I'm a bit concerned about - most notably the lack of sad-day tests and the behaviour of invalid configs exiting the process on reload.
clients/pkg/promtail/promtail.go
Outdated
level.Warn(p.logger).Log("msg", "disable watchConfig") | ||
return | ||
} | ||
level.Warn(p.logger).Log("msg", "enable watchConfig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These log messages should be more clear, please.
clients/pkg/promtail/promtail.go
Outdated
cfg := p.newConfig() | ||
if err := p.reloadConfig(cfg); err != nil { | ||
level.Error(p.logger).Log("msg", "Error reloading config", "err", err) | ||
} | ||
case rc := <-promtailServer.Reload(): | ||
cfg := p.newConfig() | ||
if err := p.reloadConfig(cfg); err != nil { | ||
level.Error(p.logger).Log("msg", "Error reloading config", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we centralised this logic rather than repeating it
@@ -51,6 +54,8 @@ type Config struct { | |||
ExternalURL string `yaml:"external_url"` | |||
HealthCheckTarget *bool `yaml:"health_check_target"` | |||
Disable bool `yaml:"disable"` | |||
Reload bool `yaml:"reload"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider naming this more clearly.
I would suggest enable_runtime_reload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
clients/pkg/promtail/promtail.go
Outdated
return nil, err | ||
} | ||
server, err := server.New(cfg.ServerConfig, promtail.logger, promtail.targetManagers, cfg.String()) | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap these errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -51,6 +54,8 @@ type Config struct { | |||
ExternalURL string `yaml:"external_url"` | |||
HealthCheckTarget *bool `yaml:"health_check_target"` | |||
Disable bool `yaml:"disable"` | |||
Reload bool `yaml:"reload"` | |||
NewByReload bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant, invalid variables left over from my development process
@@ -60,6 +65,7 @@ func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) { | |||
cfg.Config.RegisterFlags(f) | |||
|
|||
f.BoolVar(&cfg.Disable, prefix+"server.disable", false, "Disable the http and grpc server.") | |||
f.BoolVar(&cfg.Reload, prefix+"server.reload", false, "Enable reload via HTTP request.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be reloaded via SIGHUP
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, the SIGHUP way in the reload of prometheus cannot be disable.
I suggest we should keep the same behavior as prometheus.
clients/cmd/promtail/main.go
Outdated
var config Config | ||
if err := cfg.DefaultUnmarshal(&config, args, flag.NewFlagSet(os.Args[0], flag.ExitOnError)); err != nil { | ||
fmt.Println("Unable to parse config:", err) | ||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly don't think this behaviour is desired; if a runtime config cannot be reloaded, it should not kill the process.
I think we need to distinguish between the initial load and subsequent (runtime) reloads.
This will also better match how Loki works with its overrides.
Co-authored-by: Danny Kopping <[email protected]>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
All the review tips, I have made targeted changes |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
- querier/queryrange -0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
@dannykopping hi, this pr is ready for review . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @liguozhong - a few minor points to address then I think this is ready to gp
clients/pkg/promtail/promtail.go
Outdated
@@ -8,6 +8,7 @@ import ( | |||
|
|||
"github.com/go-kit/log" | |||
"github.com/go-kit/log/level" | |||
"github.com/pkg/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please use the stdlib errors where possible
clients/pkg/promtail/promtail.go
Outdated
@@ -65,20 +74,24 @@ func New(cfg config.Config, newConfig func() *config.Config, metrics *client.Met | |||
metrics: metrics, | |||
dryRun: dryRun, | |||
} | |||
err := promtail.reg.Register(reloadTotal) | |||
if err != nil { | |||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap this error
clients/pkg/promtail/promtail.go
Outdated
@@ -87,9 +100,14 @@ func New(cfg config.Config, newConfig func() *config.Config, metrics *client.Met | |||
} | |||
|
|||
func (p *Promtail) reloadConfig(cfg *config.Config) error { | |||
level.Info(p.logger).Log("msg", "Loading configuration file") | |||
level.Info(p.logger).Log("msg", "Reloading configuration file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's move this into the condition on L107 so that it only logs when a change is made.
Alternatively, you could change this to a debug
level and add an info
level log line in the condition on L107
} | ||
promtailServer, ok := p.server.(*server.PromtailServer) | ||
if !ok { | ||
level.Warn(p.logger).Log("msg", "disable watchConfig", "reason", "promtailServer cast fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message won't mean much to the user. Can we return the actual parse error here?
clients/pkg/promtail/promtail.go
Outdated
} | ||
err = p.reloadConfig(cfg) | ||
if err != nil { | ||
reloadTotal.With(prometheus.Labels{"code": "500"}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these 200
and 500
codes meant to mimick HTTP status codes?
If so, I think that's confusing. Let's be clear here and have two separate metrics - one for successful reloads and one for failures.
Co-authored-by: Danny Kopping <[email protected]>
Auto-merging clients/cmd/promtail/main.go I have to merge master branch |
# Conflicts: # clients/cmd/promtail/main.go
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
- querier/queryrange -0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
Thank you for investing so much time in helping to review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last couple of small nits around errors, and then we can get this merged 👍
clients/pkg/promtail/promtail.go
Outdated
if err != nil { | ||
return nil, err | ||
return nil, errors.Wrap(err, "error register prometheus collector reloadSuccessTotal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace these with fmt.Errorf
clients/pkg/promtail/promtail.go
Outdated
cfg, err := p.newConfig() | ||
if err != nil { | ||
reloadFailTotal.Inc() | ||
return errors.Wrap(err, "Error new Config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use fmt.Errorf
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
done ,and cicd has passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for this!
What this PR does / why we need it:
Add config reload endoint / signal to promtail.
reload is a very dangerous feature, it is easy to make promtail panic, we can refer to the failure history of another log agent
● vectordotdev/vector#10485
● vectordotdev/vector#10412
● vectordotdev/vector#13228
But "/reload" can make promtail more flexible, make loki easier to use with k8s, and provide a better foundation for log tail CRD (like prometheus ServiceMonitor).
config
detail config
Which issue(s) this PR fixes:
Fixes #16
Special notes for your reviewer:
old config
curl "/reload"
new config, "job: varlogs-reload-promtail-test" has change
Checklist
CONTRIBUTING.md
guideCHANGELOG.md
updateddocs/sources/upgrading/_index.md