Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antrea Prometheus integration #321

Merged
merged 2 commits into from
Apr 29, 2020

Conversation

ksamoray
Copy link
Contributor

@ksamoray ksamoray commented Jan 16, 2020

Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

@ksamoray
Copy link
Contributor Author

Still missing the controller listener init and obviously more metrics.
Pushing this to make sure that general direction makes sense (likely not).

build/yamls/antrea-ipsec.yml Outdated Show resolved Hide resolved
cmd/antrea-controller/controller.go Outdated Show resolved Hide resolved
cmd/antrea-controller/controller.go Outdated Show resolved Hide resolved

klog.Infof("Initializing prometheus host %s port %s", prometheusHost, prometheusPort)
gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "antrea_controller_host",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these can be constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if that's necessary. Using constant strings here is a common practice (e.g k8s itself, calico etc).
These values won't be used anywhere except in metric definitions so consts won't be reused anywhere - hence it won't enhance readability.

pkg/agent/metrics/prometheus.go Outdated Show resolved Hide resolved
@abhiraut
Copy link
Contributor

rebase and remove wip?

build/yamls/antrea.yml Outdated Show resolved Hide resolved
@ksamoray
Copy link
Contributor Author

/test-all

1 similar comment
@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray
Copy link
Contributor Author

/test-all

1 similar comment
@edwardbadboy
Copy link
Contributor

/test-all

@ksamoray ksamoray changed the title [wip] Antrea Prometheus integration Antrea Prometheus integration Mar 23, 2020
@ksamoray
Copy link
Contributor Author

/test-e2e

1 similar comment
@ksamoray
Copy link
Contributor Author

/test-e2e

build/yamls/base/conf/antrea-agent.conf Outdated Show resolved Hide resolved
build/yamls/base/agent.yml Outdated Show resolved Hide resolved
cmd/antrea-controller/controller.go Outdated Show resolved Hide resolved
cmd/antrea-controller/controller.go Outdated Show resolved Hide resolved
@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray ksamoray linked an issue Mar 24, 2020 that may be closed by this pull request
@ksamoray ksamoray self-assigned this Mar 24, 2020
Name: "antrea_agent_local_pod_count",
Help: "Number of pods on local node.",
},
func() float64 { return float64(ifaceStore.GetContainerInterfaceNum()) },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kobi,
I believe pod count is captured through number of CNIs on the node.
Will this pod count be different from pod count we get from pod metrics/node metrics from Kubernetes native Metrics API under path /apis/metrics.k8s.io/?

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could pods in hostnetwork which won't be managed by CNI, so might be slightly different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the metric description doesn't reflect that, I'll change it.

<-stopCh
klog.Info("Stopping Antrea controller")
return nil
}

// Initialize Prometheus listener and metrics collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no listener anymore right?

@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray
Copy link
Contributor Author

@srikartati the metric exposed by /apis/metrics.k8s.io/ will show all the pods on the node while this one will show the ones connected via Antrea.

@ksamoray
Copy link
Contributor Author

/test-all

@ksamoray
Copy link
Contributor Author

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm just uncertain if it's safe to expose metrics without authentication as it could have some process and pods information, I see at least Kubernetes secures all of its /metrics api. We may see how it's like for other softwares.
@jianjuns @McCodeman Do you think whether we should leave the API insecure or force secure it.

@@ -92,7 +92,7 @@ func NewOVSDBConnectionUDS(address string) (*ovsdb.OVSDB, Error) {
return db, nil
}

// NewOVSBridge creates and returns a new OVSBridge struct.
// NewOVSBridge creates and returns a new ovsBridge struct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is a side effect of IDE when you renamed OVSBridge of ovsStatManager to ovsBridge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Fixed.

klog.Errorf("Failed to retrieve controller K8S node name: %v", err)
}

klog.Info("Initializing prometheus")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use "Initializing prometheus metrics" as below to keep log consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ksamoray
Copy link
Contributor Author

@tnqn I'm not very confident about exposure without authentication.
From one hand - it's same as having metrics on a dedicated HTTP port which is a common practice.
On the other hand: I've looked at k8s itself, it secures the metrics over port 6443. which makes me think of two different options:
a) Secure it, figure out the correct config for Prometheus to consume it (if such exists and it ain't tailored to k8s which I doubt).
b) Add a config option to secure /metrics. I don't like this but it's doable.

@jianjuns
Copy link
Contributor

@ksamoray @tnqn : I thought PR #622 enable auth for agent API already. Why you are saying exposing metrics without auth? Maybe you mean we do not know how to let Premetheus authenticate with K8s?

@ksamoray
Copy link
Contributor Author

@jianjuns indeed I've secured the agent but I've also created a pinhole for the metrics URL e.g here:
https://github.com/vmware-tanzu/antrea/blob/0ab4ff6ef2c560a73e01a3dee2f049a4aa9039e4/cmd/antrea-controller/controller.go#L128
I think that @tnqn is right and I should get rid of this and have Prometheus authenticate.
I think that I've got this - I'll document how to configure Prometheus within this PR.

tnqn
tnqn previously approved these changes Apr 26, 2020
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I assume you will have a doc to explain how to enable prometheus metrics for both configuration and RBAC.

@ksamoray
Copy link
Contributor Author

@tnqn I can add the doc in a separate PR or within this one

@tnqn
Copy link
Member

tnqn commented Apr 27, 2020

@jianjuns @antoninbas The PR LGTM, could you take a look so we catch 0.6.0? There are separate PRs for docs and CI.

@tnqn
Copy link
Member

tnqn commented Apr 27, 2020

/test-all

@@ -46,3 +46,6 @@
# Note that if it's set to another value, the `containerPort` of the `api` port of the
# `antrea-agent` container must be set to the same value.
#apiPort: 10350

# Enable metrics exposure via Prometheus. Initializes Prometheus metrics listener
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing period at the end (but not for the controller config).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still see the same issue in other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, make manifest could help here :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised you didn't get a CI error actually. If the manifests are not up-to-date (i.e. if the committer forgot to run make manifest), one of the CI jobs should fail.

ConstLabels: prometheus.Labels{"k8s_nodename": nodeName, "k8s_podname": env.GetPodName()},
})
gaugeHost.Set(1)
prometheus.MustRegister(gaugeHost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a stupid question, but why do we use MustRegister here and Register for "antrea_agent_local_pod_count"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to know this as well :) what's the difference ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to the difference is that I used different sample code for both :)
MustRegister() will panic on failure while Register() returns an error.
I'm guessing that using Register() is the correct approach here unless you think otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

ofClient openflow.Client) {

klog.Info("Initializing prometheus metrics")
if err := prometheus.Register(prometheus.NewGaugeFunc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was a conversation about using the k8s wrapper objects instead of the Prometheus ones. Has this been abandoned or postponed to a future PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If postponed, please open an issue - would also be good to explain in the issue why we think the wrappers are superior for our use case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k8s wrapper was proposed by @srikartati and is described in the this doc.
It offers stability tagging for metrics, e.g mark a metric as stable, alpha, deprecated etc, and offers separate endpoint (e.g /metrics/v1) for the stable metrics.
It would be nice to have as:
A) At some point we could need this functionality.
B) It's a part of k8s ecosystem

However at this point (A) isn't very relevant and Prometheus API definitions as within this patch are still widely common.
I would rather have Prometheus support and handle this in a latter time, when this is a priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Would still like to see a low-priority Github issue with a pointer to the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Issue #662

pkg/agent/metrics/prometheus.go Show resolved Hide resolved
ConstLabels: prometheus.Labels{"k8s_nodename": nodeName, "k8s_podname": env.GetPodName()},
})
gaugeHost.Set(1)
prometheus.MustRegister(gaugeHost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to know this as well :) what's the difference ?

klog.Errorf("Failed to retrieve agent K8S node name: %v", err)
}

gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a newbie reading prometheus related code.. can you add a code comment explaining what NewGauge really does? if it is pretty common and basic then ignore my comment.. i will check it out in prometheus docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus offers 4 metrics types: gauge, counter, histogram and summary.
Each can carry additional labels.
e.g, in our case, antrea_agent_ovs_flow_table will be labeled with the source agent,
the table id etc.

More info here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks sounds good


gaugeHost := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "antrea_agent_runtime_info",
Help: "Antrea agent runtime info , defined as a labels. The value of the gauge is always set to 1.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Help: "Antrea agent runtime info , defined as a labels. The value of the gauge is always set to 1.",
Help: "Antrea agent runtime info , defined as labels. The value of the gauge is always set to 1.",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Antrea agent uses apiserver to interact with antctl. However, unlike the
controller, the API endpoint is exposed only locally which limits antctl
to local execution.
Additionally, we would like to reuse the listener for Prometheus metrics,
which would require external access to the API endpoint's metrics path.

Having authentication in a similar way to the controller's implementation
could help here as it will allow external exposure of the API endpoint.

ovsStats := newOVSStatManager(ovsBridge, ofClient)
if err := prometheus.Register(ovsStats); err != nil {
klog.Error("Failed to register antrea_agent_local_pod_count with Prometheus")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksamoray do you mean antrea_agent_ovs_flow_table here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Thanks

Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.
@ksamoray
Copy link
Contributor Author

/test-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/test-networkpolicy

@antoninbas antoninbas added this to the Antrea v0.6.0 release milestone Apr 28, 2020
@antoninbas antoninbas merged commit ec17ec0 into antrea-io:master Apr 29, 2020
Dyanngg pushed a commit to Dyanngg/antrea that referenced this pull request Apr 29, 2020
Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.

See antrea-io#236
@ksamoray ksamoray deleted the prometheus-integration branch May 29, 2020 12:49
McCodeman pushed a commit to McCodeman/antrea that referenced this pull request Jun 2, 2020
Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.

See antrea-io#236
McCodeman pushed a commit that referenced this pull request Jun 2, 2020
Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.

See #236
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
Integrate with Prometheus monitoring solution.
Integration of the Prometheus client into Antrea controller and agent
allows the exposure of various metrics to Prometheus server.
In addition to Antrea's own set of metrics, Prometheus client will also
expose metrics which are defined by various components which are part of
the Antrea ecosystem, e.g golang, Prometheus itself etc.

See antrea-io#236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exporting prometheus metrics for OVS performance and routes