-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metrics registering #1258
Fix metrics registering #1258
Conversation
@matevzmihalic I think it would be better to rebase this PR on branch |
d180b08
to
587e268
Compare
587e268
to
3597105
Compare
Rebased to v1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I also ran into this problem but couldn't identify the cause immediately.
Could we maybe add some testing for the new functionality?
f7accdb
to
f4bdfbd
Compare
Added tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @matevzmihalic, thanks a lot :)
LGTM!
middlewares/prometheus_test.go
Outdated
n.ServeHTTP(recorder, req2) | ||
|
||
body = recorder.Body.String() | ||
re := regexp.MustCompile(`(?m)traefik_requests_total.*?(\d)$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems to be testing the implementation of the metrics handler. While I'm all in favor of doing that, I believe it should go into (the yet missing) metrics_test.go
.
More importantly though, I'd probably test it differently by means of comparing the counter label name and value of the underlying LabelPair
. Doing a regular expression to extract the number of total requests is somewhat brittle since we rely on an implementation detail of Prometheus that could change in a future version.
My proposal would be to remove this part of the test and leave it for another PR (which I'd be happy to do as well). I think the next test paragraph covers what we want to achieve with the PR at hand.
middlewares/prometheus_test.go
Outdated
} | ||
|
||
metrics, _ := prometheus.DefaultGatherer.Gather() | ||
if metrics[len(metrics)-1].Metric[0].GetCounter().GetValue() != 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't understand: Why do we expect a total request count of 2 in line 61 and 3 here? Can you explain? I suppose I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we make 2 requests, then we make another request (line 57) which outputs metrics and after that the counter gets updated to 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the request happens on line 57, I'd have expected the check on line 61 to also compare against 3; however, it expects 2.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time when request 3 happens the counter still has value 2 and this value is outputted to body (which is used in comparison on line 61). Only after that the counter increases. (But this doesn't really matter because I will remove the regex check.)
middlewares/prometheus_test.go
Outdated
metrics, _ := prometheus.DefaultGatherer.Gather() | ||
if metrics[len(metrics)-1].Metric[0].GetCounter().GetValue() != 3 { | ||
t.Error("gathered metrics does not contain correct value for total requests") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend the test to also cover the Histogram tracking the request duration? I think we can use the Histogram SampleCount
to assert on.
middlewares/prometheus_test.go
Outdated
t.Error("body does not contain correct value for total requests") | ||
} | ||
|
||
metrics, _ := prometheus.DefaultGatherer.Gather() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check the error and panic if it's non-nil? It's basically an invariant we want to ensure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename metrics
to metricsFamily
? This will make understanding the next line easier I think.
I've added labels tests and histogram test. |
0cb2a15
to
1629839
Compare
middlewares/prometheus_test.go
Outdated
|
||
metricsFamily, err := prometheus.DefaultGatherer.Gather() | ||
if err != nil { | ||
t.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to
t.Fatalf("could not gather metrics family: %s", err)
Fatal because we shouldn't try to continue the test if we cannot even gather the metrics family.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, I really do not agree on this. Why would Traefik stop because of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: this isn't about Traefik stopping; it's about the test bailing out because the remainder of it relies upon the metric family being available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... Again, read too quickly sorry... I didn't see it was in a test...
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if reqsMetricFamily == nil { | ||
t.Errorf("gathered metrics doesn not contain request total entry '%s'", reqsName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: doesn
-> doesn't
.
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if latencyMetricFamily == nil { | ||
t.Errorf("gathered metrics doesn not contain request duration entry '%s'", latencyName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeated typo.
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if reqsMetricFamily == nil { | ||
t.Errorf("gathered metrics does not contain request total entry '%s'", reqsName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: do not
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if latencyMetricFamily == nil { | ||
t.Errorf("gathered metrics does not contain request duration entry '%s'", latencyName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: do not
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if reqsMetricFamily.Metric[0].Counter.GetValue() != 3 { | ||
t.Error("gathered metrics does not contain correct value for total requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do t.Errorf
here and include the actual counter value.
middlewares/prometheus_test.go
Outdated
} | ||
|
||
if latencyMetricFamily.Metric[0].Histogram.GetSampleCount() != 3 { | ||
t.Error("gathered metrics does not contain correct sample count for request duration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do t.Errorf
here and include the actual sample count value.
middlewares/prometheus_test.go
Outdated
labels map[string]string | ||
}{ | ||
{ | ||
reqsName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the struct field names on each line, i.e., name: reqsName
here? This improves readability and pleases some linters.
middlewares/prometheus_test.go
Outdated
} | ||
} | ||
|
||
if reqsMetricFamily == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test in its current implementation has two shortcomings:
- If
reqsMetricFamily
here (orlatencyMetricFamily
later) isnil
, we still continue with the test execution. At best, this leads to follow-up test errors that we should not have to care about but will aggravate interpreting the test result. At worst, we run into nil pointer dereferences in lines 106, 117, and/or 121. - We cannot easily ignore/shortcut an individual
nil
case given the chosen test design.
There are a few options to fix this. Here's the one I prefer:
- Rename the
expectedLabelsForMetricFamily
struct intotests
and have it include a customassert
field of typefunc(*dto.MetricFamily)
. This function is supposed to implement the check in line 117 for thereqsMetricFamily
case and the one in line 121 forlatencyMetricFamily
. (The idea is somewhat similar to theparserFunc
in this test.) - Move the lookup of the metric families by metric name into the test loop (i.e., between line 105 and 106). This way, we can
t.Errorf
andcontinue
the loop if we cannot find a metric family. Also, we can move themetricFamily
variable from the test struct to the stack inside the loop. - Execute
assert
in the loop, passing it the metric family we just looked up.
I'd also suggest to move the lookup functionality into a small helper function, i.e., findMetricFamily(name string, families []*dto.MetricFamily) *dto.MetricFamily
.
middlewares/prometheus_test.go
Outdated
"service": "test", | ||
}, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right after the definition of the test cases, I think we should do a check between the number of metric families we found and the number of test cases we have. This way, we can catch the situation where have added a new metric without updating the test.
Updated tests with your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superb job, @matevzmihalic -- thanks for bearing with me.
LGTM! 🎉
Syncing with master via Github UI and merging afterwards if CI goes green. |
ce67ee3
to
b5de37e
Compare
Rebased like a good Traefik citizen (and squashed those fixup commits). :-) |
Functions
prometheus.NewCounterFrom
andprometheus.NewHistogramFrom
are usingstdprometheus.MustRegister
which will panic if you try to register the same metric multiple times.Because of this you would get
Error in Go routine: duplicate metrics collector registration attempted
when traefik tries to update configuration and new services will not get added.With this fix traefik will use already registered metrics if that happens.