-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metrics): add experimental otel metrics collector #2988
Conversation
5e9c4cf
to
6c69ca2
Compare
721b864
to
66798b4
Compare
60b9c23
to
0b835f2
Compare
receivers: | ||
{{- if not .Values.sumologic.metrics.collector.otelcol.enabled }} | ||
- telegraf | ||
{{- else }} | ||
- prometheus | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to enable users to add other receivers soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a use case for that? It's assumed right now that everything exposes Prometheus metrics either directly or via a sidecar. Either way, I wouldn't want to add that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know that it requires some more work for now, but I was just wondering what is going to be the future of this collector.
12ea9ba
to
b6c364a
Compare
prometheus: | ||
enabled: false | ||
prometheusOperator: | ||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could both, the prometheus operator and the opentelemetry operator be enabled by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are. Unless I'm misunderstanding the question.
@@ -405,6 +405,20 @@ sumologic: | |||
## Defines metrics metadata enrichment provider - `otelcol` or `fluentd`. `otelcol` is the default and is recommended. `fluentd` is deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we change this comment to say Prometheus is the default for metrics collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because it's not the default yet.
@@ -0,0 +1,57 @@ | |||
{{- if and (eq (include "metrics.otelcol.enabled" .) "true") .Values.sumologic.metrics.collector.otelcol.enabled }} | |||
apiVersion: opentelemetry.io/v1alpha1 | |||
kind: OpenTelemetryCollector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opentelemetry CR! I guess we included the opentelemetry operator in the chart in an earlier PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been included for a while actually, though it's disabled by default. It's the only way to do auto-instrumentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a warning informing user that Opentelemetry operator must be enabled to use otelcol metrics collector? Something similar to warnings in https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/deploy/helm/sumologic/templates/NOTES.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not add anything user facing here yet. This is clearly marked as experimental, and it's currently more for our internal testing than anything else. And there is a comment in values.yaml saying that opentelemetry-operator needs to be enabled.
requests: | ||
storage: {{ .Values.metadata.persistence.size }} | ||
{{- end }} | ||
config: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the SIG meeting, I guess the way this configuration is specified could change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than change, there's probably going to be other ways to define it.
# regex: (id|name) | ||
relabel_configs: *relabel_configs # partially copied from what operator generates | ||
target_allocator: | ||
endpoint: http://{{ template "sumologic.metadata.name.metrics.targetallocator.name" . }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so does the collector use this endpoint to discover the targets to be scraped?
Does this get converted to http_sd_configs
by the operator under the hood...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a feature of prometheus receiver, and it doesn't use http_sd_configs as far as I know. The operator actually gives us the targetallocator service.
mode: statefulset | ||
replicas: 3 | ||
serviceAccount: {{ template "sumologic.metadata.name.metrics.collector.serviceaccount" . }} | ||
targetAllocator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the operator create a target allocator deployment for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
targetAllocator: | ||
serviceAccount: {{ template "sumologic.metadata.name.metrics.targetallocator.serviceaccount" . }} | ||
enabled: true | ||
prometheusCR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does enabling the prometheus CR under the target allocator do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the target allocator watch for Service and Pod Monitors, and turn them into Prometheus scrape target configuration. Without this option it only acts on the discovery config defined explicitly in prometheus receiver config.
{{- include "sumologic.labels.common" . | nindent 4 }} | ||
spec: | ||
mode: statefulset | ||
replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is worthy to make this configurable, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of settings that should be made configurable before we move this to a user-facing beta. For this PR, I wanted to limit the complexity, since it's big and complex as-is. So I only exposed the configuration necessary to run integration tests successfully.
It can be done in another pull request but it will be worthy to add diagram and documentation to show data flow for metrics collection using otelcol metrics collector, without documentation a lot of people could have a problem with understanding this, even if it is experimental feature. |
71e56e2
to
042b761
Compare
endpoint: http://{{ template "sumologic.metadata.name.metrics.targetallocator.name" . }} | ||
interval: 30s | ||
collector_id: ${POD_NAME} | ||
extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should perhaps extract the extension definitions to another file, to prevent duplication. This can be done in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extensions aren't the same, the new metrics collector is managed by the operator, and the operator injects the health check extension automatically, and takes care of opening the right port and adding the health checks in K8s. They're mostly the same though, so there's room to improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 👏 Let's run this 🚀
042b761
to
c66e812
Compare
Experimental otel metrics collector to replace Prometheus. This is done by using the opentelemetry operator with the target allocator and Prometheus CR options enabled. This causes the target allocator to take the role of Prometheus Operator: tracking Service and Pod Monitors and translating that into raw Prometheus configuration. The target allocator also allocates the scrape targets to otel Pods, so this setup is scalable.
This collector replaces the metrics metadata StatefulSet - it's essentially the same pipeline, but with a different receiver
Intentionally missing in this PR:
TODO:
Checklist