-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Antrea Prometheus integration #321
Conversation
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
Still missing the controller listener init and obviously more metrics. |
cmd/antrea-controller/controller.go
Outdated
|
||
klog.Infof("Initializing prometheus host %s port %s", prometheusHost, prometheusPort) | ||
gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "antrea_controller_host", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of these can be constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK if that's necessary. Using constant strings here is a common practice (e.g k8s itself, calico etc).
These values won't be used anywhere except in metric definitions so consts won't be reused anywhere - hence it won't enhance readability.
rebase and remove wip? |
a2fc0fd
to
e15a3e6
Compare
/test-all |
1 similar comment
/test-all |
e15a3e6
to
e401e3b
Compare
/test-all |
1 similar comment
/test-all |
/test-e2e |
1 similar comment
/test-e2e |
e401e3b
to
a51f9d7
Compare
/test-all |
Name: "antrea_agent_local_pod_count", | ||
Help: "Number of pods on local node.", | ||
}, | ||
func() float64 { return float64(ifaceStore.GetContainerInterfaceNum()) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kobi,
I believe pod count is captured through number of CNIs on the node.
Will this pod count be different from pod count we get from pod metrics/node metrics from Kubernetes native Metrics API under path /apis/metrics.k8s.io/?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could pods in hostnetwork which won't be managed by CNI, so might be slightly different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the metric description doesn't reflect that, I'll change it.
cmd/antrea-controller/controller.go
Outdated
<-stopCh | ||
klog.Info("Stopping Antrea controller") | ||
return nil | ||
} | ||
|
||
// Initialize Prometheus listener and metrics collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no listener anymore right?
/test-all |
a51f9d7
to
f62e560
Compare
@srikartati the metric exposed by /apis/metrics.k8s.io/ will show all the pods on the node while this one will show the ones connected via Antrea. |
f62e560
to
742e935
Compare
/test-all |
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm just uncertain if it's safe to expose metrics without authentication as it could have some process and pods information, I see at least Kubernetes secures all of its /metrics api. We may see how it's like for other softwares.
@jianjuns @McCodeman Do you think whether we should leave the API insecure or force secure it.
pkg/ovs/ovsconfig/ovs_client.go
Outdated
@@ -92,7 +92,7 @@ func NewOVSDBConnectionUDS(address string) (*ovsdb.OVSDB, Error) { | |||
return db, nil | |||
} | |||
|
|||
// NewOVSBridge creates and returns a new OVSBridge struct. | |||
// NewOVSBridge creates and returns a new ovsBridge struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess this is a side effect of IDE when you renamed OVSBridge
of ovsStatManager
to ovsBridge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Fixed.
pkg/controller/metrics/prometheus.go
Outdated
klog.Errorf("Failed to retrieve controller K8S node name: %v", err) | ||
} | ||
|
||
klog.Info("Initializing prometheus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use "Initializing prometheus metrics" as below to keep log consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@tnqn I'm not very confident about exposure without authentication. |
@jianjuns indeed I've secured the agent but I've also created a pinhole for the metrics URL e.g here: |
0ab4ff6
to
6abee53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I assume you will have a doc to explain how to enable prometheus metrics for both configuration and RBAC.
@tnqn I can add the doc in a separate PR or within this one |
@jianjuns @antoninbas The PR LGTM, could you take a look so we catch 0.6.0? There are separate PRs for docs and CI. |
/test-all |
@@ -46,3 +46,6 @@ | |||
# Note that if it's set to another value, the `containerPort` of the `api` port of the | |||
# `antrea-agent` container must be set to the same value. | |||
#apiPort: 10350 | |||
|
|||
# Enable metrics exposure via Prometheus. Initializes Prometheus metrics listener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing period at the end (but not for the controller config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still see the same issue in other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, make manifest could help here :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised you didn't get a CI error actually. If the manifests are not up-to-date (i.e. if the committer forgot to run make manifest
), one of the CI jobs should fail.
pkg/agent/metrics/prometheus.go
Outdated
ConstLabels: prometheus.Labels{"k8s_nodename": nodeName, "k8s_podname": env.GetPodName()}, | ||
}) | ||
gaugeHost.Set(1) | ||
prometheus.MustRegister(gaugeHost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a stupid question, but why do we use MustRegister
here and Register
for "antrea_agent_local_pod_count"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to know this as well :) what's the difference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to the difference is that I used different sample code for both :)
MustRegister() will panic on failure while Register() returns an error.
I'm guessing that using Register() is the correct approach here unless you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
pkg/agent/metrics/prometheus.go
Outdated
ofClient openflow.Client) { | ||
|
||
klog.Info("Initializing prometheus metrics") | ||
if err := prometheus.Register(prometheus.NewGaugeFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there was a conversation about using the k8s wrapper objects instead of the Prometheus ones. Has this been abandoned or postponed to a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If postponed, please open an issue - would also be good to explain in the issue why we think the wrappers are superior for our use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The k8s wrapper was proposed by @srikartati and is described in the this doc.
It offers stability tagging for metrics, e.g mark a metric as stable, alpha, deprecated etc, and offers separate endpoint (e.g /metrics/v1) for the stable metrics.
It would be nice to have as:
A) At some point we could need this functionality.
B) It's a part of k8s ecosystem
However at this point (A) isn't very relevant and Prometheus API definitions as within this patch are still widely common.
I would rather have Prometheus support and handle this in a latter time, when this is a priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Would still like to see a low-priority Github issue with a pointer to the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Issue #662
pkg/agent/metrics/prometheus.go
Outdated
ConstLabels: prometheus.Labels{"k8s_nodename": nodeName, "k8s_podname": env.GetPodName()}, | ||
}) | ||
gaugeHost.Set(1) | ||
prometheus.MustRegister(gaugeHost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to know this as well :) what's the difference ?
klog.Errorf("Failed to retrieve agent K8S node name: %v", err) | ||
} | ||
|
||
gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a newbie reading prometheus related code.. can you add a code comment explaining what NewGauge really does? if it is pretty common and basic then ignore my comment.. i will check it out in prometheus docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus offers 4 metrics types: gauge, counter, histogram and summary.
Each can carry additional labels.
e.g, in our case, antrea_agent_ovs_flow_table will be labeled with the source agent,
the table id etc.
More info here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks sounds good
pkg/agent/metrics/prometheus.go
Outdated
|
||
gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "antrea_agent_runtime_info", | ||
Help: "Antrea agent runtime info , defined as a labels. The value of the gauge is always set to 1.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "Antrea agent runtime info , defined as a labels. The value of the gauge is always set to 1.", | |
Help: "Antrea agent runtime info , defined as labels. The value of the gauge is always set to 1.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Antrea agent uses apiserver to interact with antctl. However, unlike the controller, the API endpoint is exposed only locally which limits antctl to local execution. Additionally, we would like to reuse the listener for Prometheus metrics, which would require external access to the API endpoint's metrics path. Having authentication in a similar way to the controller's implementation could help here as it will allow external exposure of the API endpoint.
6abee53
to
181d3ed
Compare
pkg/agent/metrics/prometheus.go
Outdated
|
||
ovsStats := newOVSStatManager(ovsBridge, ofClient) | ||
if err := prometheus.Register(ovsStats); err != nil { | ||
klog.Error("Failed to register antrea_agent_local_pod_count with Prometheus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ksamoray do you mean antrea_agent_ovs_flow_table
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Thanks
Integrate with Prometheus monitoring solution. Integration of the Prometheus client into Antrea controller and agent allows the exposure of various metrics to Prometheus server. In addition to Antrea's own set of metrics, Prometheus client will also expose metrics which are defined by various components which are part of the Antrea ecosystem, e.g golang, Prometheus itself etc.
181d3ed
to
9845b1b
Compare
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-networkpolicy |
Integrate with Prometheus monitoring solution. Integration of the Prometheus client into Antrea controller and agent allows the exposure of various metrics to Prometheus server. In addition to Antrea's own set of metrics, Prometheus client will also expose metrics which are defined by various components which are part of the Antrea ecosystem, e.g golang, Prometheus itself etc. See antrea-io#236
Integrate with Prometheus monitoring solution. Integration of the Prometheus client into Antrea controller and agent allows the exposure of various metrics to Prometheus server. In addition to Antrea's own set of metrics, Prometheus client will also expose metrics which are defined by various components which are part of the Antrea ecosystem, e.g golang, Prometheus itself etc. See antrea-io#236
Integrate with Prometheus monitoring solution. Integration of the Prometheus client into Antrea controller and agent allows the exposure of various metrics to Prometheus server. In addition to Antrea's own set of metrics, Prometheus client will also expose metrics which are defined by various components which are part of the Antrea ecosystem, e.g golang, Prometheus itself etc. See #236
Integrate with Prometheus monitoring solution. Integration of the Prometheus client into Antrea controller and agent allows the exposure of various metrics to Prometheus server. In addition to Antrea's own set of metrics, Prometheus client will also expose metrics which are defined by various components which are part of the Antrea ecosystem, e.g golang, Prometheus itself etc. See antrea-io#236
Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.