Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(logging): Update otel and tracing config #5291

Merged
merged 4 commits into from
Feb 10, 2024
Merged

Conversation

lc525
Copy link
Member

@lc525 lc525 commented Feb 5, 2024

What this PR does / why we need it:

  • fixes dataflow engine logging errors by specifying the otel exporter protocol

the updates are needed because the opentelemetry-java-instrumentation library requires a http(s) URI for the otel collector endpoint, regardless of the actual protocol. As the default for us is grpc, explicitly set the OTEL_EXPORTER_OTLP_PROTOCOL environment variable on dataflow pods

  • fixes otel-collector config to remove deprecated jagger exporter (jagger now supports otel directly).

  • add the OTEL_EXPORTED_OTLP_PROTOCOL key to the seldon-tracing configMap and update the operator, crds and helm charts to support getting the value for this key from tracing config, similarly to how OTEL_EXPORTER_OTLP_PROTOCOL is fetched

  • update versions used by ansible for jagger and opentelemetry-operator

  • port 14250 no longer needs to be exposed under any config

  • fix dependency ordering for dataflow/gradle.

    • previous ordering caused kafka-streams not to be able to find the slf4j logging provider
    • this lead to logs produced by kafka-streams not being recorded

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #
Internal issue references:

  • #INFRA-568 Jagger latest is crashing
  • #INFRA-464 Otel is not able to parse its config (deprecated exporters)

Public issues:

Special notes for your reviewer:

  • helm chart updates generated via makefile; in order to ensure results consistent to previous helm charts (i.e descriptions that don't disappear), I had to add some docs / update the makefile; those changes are unrelated to the primary goal of the PR.
  • k8s/yamls/* - auto generated via makefile

@@ -55,4 +55,5 @@ spec:
tracingConfig:
disable: {{ .Values.config.tracingConfig.disable }}
otelExporterEndpoint: {{ .Values.config.tracingConfig.otelExporterEndpoint }}
otelExporterProtocol: {{ .Values.config.tracingConfig.otelExporterProtocol }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RafalSkolasinski following this change, should we update any of the helm chart versions? I don't know how helm chart version updates are handled, or if there is a policy on updating them. Atm, for example, I've seen that the appVersion has nothing to do with the Seldon Core version. Happy to follow what you suggest here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions I've added here were in the current helm-charts but not in the sources that generate them via kustomize (i.e. this file). I've added them because otherwise those descriptions would have disappeared on re-generation via the Makefile.

@@ -9,14 +9,14 @@ exporters:
const_labels:
provider: seldon

logging:
debug:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logger exporter is deprecated and replaced by debug

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. left minor comments / questions.

k8s/Makefile Outdated
@@ -25,7 +25,7 @@ create-helm-charts:

.PHONY: create-yaml
create-yaml:
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs | grep -v "namespace:" > yaml/certs.yaml
helm template -n seldon-mesh seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs | grep -v "namespace:" > yaml/certs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to fix it to seldon-mesh?

Copy link
Member Author

@lc525 lc525 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. However, the certs.yaml file currently generated from the helm chart has seldon-mesh, so not adding the namespace here would have changed lots of dnsNames from seldon-mesh to default on re-generation. I've simply added this here to avoid any changes to the generated certs.yaml

Perhaps @RafalSkolasinski could suggest a better way of dealing with this? (i.e. even it involves changing the current certs.yaml)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the grep -v cmd to exclude the namespace from the crds as an attempt to make it generic. However I have not tested a deployment with these yaml manifests in any other namespace. I have not realised before that we have things such as dns settings with seldon-mesh hardcoded inside.

Copy link
Member Author

@lc525 lc525 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I saw that; It's just that in the case of certs there's never any "namespace:" reference, but the template contains lots of {{ .Release.Namespace }} references within the dnsNames lists, and that gets replaced with either default or whatever gets passed to helm template -n when generating yamls.

Technically, anybody could generate a valid certs.yaml for their own namespace name by running helm template with the right -n [namespace]. Not sure if there's a way of making the yaml generic here. Also, this is hardcoded only at the resulting yaml level, not in terms of the helm chart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I wonder then if we should fix the namespace in all yaml manifests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I've checked. In all the other ones {{ .Release.Namespace }} only appears as a value of the namespace key, which is removed by the grep -v, so the resulting yamls do not end up containing any hardcoded values. In that sense they are generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file generated from helm charts via Makefile (make create-yaml)

lc525 added 4 commits February 9, 2024 23:14
- fixes dataflow engine logging errors by specifying the otel exporter protocol

  this is required because the `opentelemetry-java-instrumentation` library
  requires a http(s) URI for the otel collector endpoint, regardless of the
  actual protocol. As the default for us is grpc, explicitly set the
  OTEL_EXPORTER_OTLP_PROTOCOL environment variable on dataflow pods

- add the OTEL_EXPORTED_OTLP_PROTOCOL key to the seldon-tracing configMap
  and update the operator, crds and helm charts to support getting the
  value for this key from tracing config, similarly to how
  OTEL_EXPORTER_OTLP_PROTOCOL is fetched

- update configs for when running outside k8s to match those in k8s
The order in which dependencies are specified for Gradle matters,
influencing the ordering in the classpath of the application.

Here, the slf4j klogging plugin was introduced after the kafka
streams dependency, leading to errors at runtime, with the kafka-streams
logging facility not able to find a slf4j provider.
kustomize bases and crds now obtained via running `make manifests` in the
operator dir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants