Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work needed to enable useOtelForInternalMetrics by default #7454

Closed
1 task done
codeboten opened this issue Mar 29, 2023 · 16 comments · Fixed by #9037
Closed
1 task done

Work needed to enable useOtelForInternalMetrics by default #7454

codeboten opened this issue Mar 29, 2023 · 16 comments · Fixed by #9037

Comments

@codeboten
Copy link
Contributor

codeboten commented Mar 29, 2023

This issue is to discuss the work needed to enable using the OTel SDK by default.

Related issues:

@jmacd
Copy link
Contributor

jmacd commented Mar 29, 2023

I've experimented with using metrics generated by an OTel collector running with the above-mentioned flag. I'm picking up work started by @paivagustavo. This is an experience report:

I modified this example to configure the Prometheus receiver inside a collector for self observability like:

  prometheus:
    config:
      scrape_configs:
      - job_name: self
        scrape_interval: 5s
        static_configs:
        - targets:
          - 127.0.0.1:8888
// ...
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [logging, otlp/arrow]
  telemetry:
    resource:
      "service.name": "example-bridge"
    metrics:
      address: 127.0.0.1:8888
      level: detailed
    logs:
      level: info

which is similar to the example configuration given in that receiver's README.

To address the problem in #6403, I use --feature-gates=telemetry.useOtelForInternalMetrics,pkg.translator.prometheus.NormalizeName flags together.

The result, viewed by Prometheus has a mix of metrics produced from OC and OTel. This is an example from the OC path:

# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="otlp/arrow",service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 0

See that service_name is the configured service.name, service_version was set via the builder and service_instance_id was synthesized while setting up telemetry.

Here is an example from the OTel instrumentation with the associated target info.

# HELP otelcol_exporter_sent_wire_total Number of bytes sent on the wire by the component.
# TYPE otelcol_exporter_sent_wire_total counter
otelcol_exporter_sent_wire_total{exporter="otlp/arrow"} 69203

# HELP otelcol_target_info Target metadata
# TYPE otelcol_target_info gauge
otelcol_target_info{service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 1

The same metrics are also pushed via OTLP through an example pipeline, which I've printed. The resource comes through like:

  resource: {
    attributes: {
      key: "service.name"
      value: {
        string_value: "self"
      }
    }
    attributes: {
      key: "service.instance.id"
      value: {
        string_value: "127.0.0.1:8888"
      }
    }
    attributes: {
      key: "net.host.port"
      value: {
        string_value: "8888"
      }
    }
    attributes: {
      key: "http.scheme"
      value: {
        string_value: "http"
      }
    }
  }

Note that the original service name has been lost, overridden by the Prometheus scrape config's job name. The original service instance ID is also lost. The net.host.port has been inserted, but I'm not sure it should be in this case, because I'd prefer not to have lost the original instance ID. The http.scheme is surely not relevant as a resource, here.

I can't explain where the service.version attribute has gone -- it should have come through as a resource attribute.

Next, I experimented with whether you can fix some of these issues using Prometheus relabeling rules, e.g., using

        - regex: service_version
          action: labeldrop
        - regex: __scheme__
          action: labeldrop
        - source_labels: [service_instance_id]
          action: replace
          target_label: instance
        - source_labels: [service_name]
          action: replace
          target_label: job

but this doesn't help much -- because anything you do at this step is before the logic that converts Prom data back to OTLP (which is where the problems arise).

Here's what the OC metric looks like after OTLP conversion:

    metrics: {
      name: "otelcol_exporter_enqueue_failed_log_records"
      description: "Number of log records failed to be added to the sending queue."
      sum: {
        data_points: {
          attributes: {
            key: "exporter"
            value: {
              string_value: "logging"
            }
          }
          attributes: {
            key: "service_name"
            value: {
              string_value: "example-bridge"
            }
          }
          attributes: {
            key: "service_version"
            value: {
              string_value: "0.0.1-dev"
            }
          }
          attributes: {
            key: "service_instance_id"
            value: {
              string_value: "955ece9f-2577-410e-ab0a-8ea042c6de5d"
            }
          }
          start_time_unix_nano: 1680124224552000000
          time_unix_nano: 1680124584556000000
          as_double: 0
        }

It appears that there are several issues, not all of them need to be addressed.

  1. The Prometheus-to-OTLP conversion is leaving me a bit unhappy. I'd like a way for that conversion to honor the existing OTel conventions, somehow. Something like honor_labels: true in the Prometheus configuration, but for honoring OTel attributes. So, instead of honoring job and instance, I want to honor service.name, service.instance.id, and service.version from the scraped data (i.e., from the target_into). Possibly that is the intention behind pkg.translator.prometheus.PermissiveLabelSanitization.
  2. Somehow the target_into metric is not being respected, otherwise we'd see a service.version attribute in the printout
  3. The OpenCensus-generated metrics have unwanted service_name, service_instance_id, and service_version attributes which should be disabled when OTel SDKs are being used

@codeboten
Copy link
Contributor Author

codeboten commented Apr 3, 2023

@jmacd I believe this PR will help w/ the third point.

The issue is that currently both OC and OT are using prometheus as a bridge to allow components to continue using opencensus instrumentation. The change I'm proposing disables the OC configuration when useOtelForInternalMetrics is enabled and relies on the oc to otel bridge instead.

@mx-psi
Copy link
Member

mx-psi commented Apr 12, 2023

We may need to address #7517 before closing this

@mx-psi
Copy link
Member

mx-psi commented Aug 11, 2023

The check_collector_pipeline functionality in the health check extension depends on OpenCensus, see here. We would need to either deprecate this functionality or update it to use OTel before enabling useOtelForInternalMetrics

codeboten pushed a commit to codeboten/opentelemetry-collector that referenced this issue Nov 20, 2023
This ensures backwards compatibility with the metrics generated via opencensus by default today.

Part of open-telemetry#7454

Signed-off-by: Alex Boten <[email protected]>
mx-psi pushed a commit that referenced this issue Nov 21, 2023
This ensures backwards compatibility with the metrics generated via
opencensus by default today.

Part of #7454

Signed-off-by: Alex Boten <[email protected]>
@codeboten
Copy link
Contributor Author

With the PR for _total suffix merging yesterday, I started looking at the second issue @jmacd mentioned in his comment above. With opencensus telemetry, service labels are applied on each metric:

otelcol_exporter_queue_capacity{exporter="otlp",service_instance_id="aae63e2f-cad2-4224-bbb3-806689b0a2d0",service_name="otelcontribcol",service_version="0.89.0-dev"} 1000

With otelForInternalMetrics enabled, all that information is set in the target_info:

otelcol_target_info{service_instance_id="73a547db-f135-45c4-870e-cde704b3cf7d",service_name="otelcontribcol",service_version="0.89.0-dev"} 1

I'm not sure that there's a good mechanism today for applying info from target_info as resource attributes. @dashpole pointed me to a spec change proposed to address this for prometheus exporters open-telemetry/opentelemetry-specification#3761 (thanks David!)

I don't know what the right way to move forward with this change is with the current state. I did test the OTLP exporter for internal metrics and the resource attributes contained the service information I expected. The following is the output from the Go stdout exporter:

  "Resource": [
    {
      "Key": "service.instance.id",
      "Value": {
        "Type": "STRING",
        "Value": "cdc7786f-fcfc-4530-98cd-7278e6dbcf5c"
      }
    },
    {
      "Key": "service.name",
      "Value": {
        "Type": "STRING",
        "Value": "otelcontribcol"
      }
    },
    {
      "Key": "service.version",
      "Value": {
        "Type": "STRING",
        "Value": "0.89.0-dev"
      }
    }
  ],

In other words, for end users that want OTLP for their collector metrics (like @jmacd's original scenario), this would be a good replacement for what is only available today w/ prometheus metrics (with OC). For users that want to continue using the Prometheus export functionality, I'm not sure what to recommend.

@dashpole
Copy link
Contributor

Do you know why otelcol_target_info has the otelcol prefix? We would really like to keep the name unmodified, if possible.

@codeboten
Copy link
Contributor Author

Do you know why otelcol_target_info has the otelcol prefix? We would really like to keep the name unmodified, if possible.

It's prefixed here https://github.com/open-telemetry/opentelemetry-collector/blob/445960b82fb4f4a979bfa5e6ce88aa754d22b92d/service/internal/proctelemetry/config.go#L206C1-L206C84

@dashpole
Copy link
Contributor

I think we can wrap Collect to add resource attributes as additional labels. That is what the prometheus exporter implements. In Collect, we would need to wrap each Metric, and override the Write function. There, we would need to add our additional labels.

It would end up looking similar to https://github.com/prometheus/client_golang/blob/80d3f0b5b36c891016370423516e353b1a5f0a55/prometheus/metric.go#L172, which wraps another metric and adds exemplars to it.

@jpkrohling
Copy link
Member

During the SIG call, I mentioned that Java implemented a double-emit so that an application could use both a new and an old semantic convention at the same time, to make it easier to migrate between versions without breaking other parts of the observability pipeline. That feature is described here: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/README.md

@codeboten
Copy link
Contributor Author

@jpkrohling I guess this would raise the question of what's the correct thing to emit since there are a couple of differences between the current metrics and what's are the correct metrics:

  1. _total suffix should be emitted
  2. target_info should be present and new metrics should not include resource labels

Are you proposing that we have two endpoints that emit these metrics at the same time? I'm not sure what the deprecation story would be here, since it would be hard for users to know this change is coming. We saw changing metrics surprise users when we tried to change the default behaviour of counter metrics to include the _total suffix in the past.

Maybe the solution is to have a configuration option to use the legacy format?

@codeboten
Copy link
Contributor Author

@dashpole I followed up on your question about the prefix, and made the change: #8988

I wanted to answer another question from the SIG meeting which was whether the prometheus receiver was able to use this target_info. Before the change, it would treat otelcol_target_info as any other metric. After #8988, the prometheus receiver to use target_info and add its contents in the emitted telemetry. The following is the resource attributes output of a prometheus receiver scraping the otel collector internal metrics and emitting them as a debug exporter:

2023-11-23T12:04:37.361-0800	info	ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
     -> service.name: Str(self)
     -> service.instance.id: Str(127.0.0.1:8889)
     -> net.host.port: Str(8889)
     -> http.scheme: Str(http)
     -> service_instance_id: Str(59dee129-67c4-4981-b077-e4e04aa3e4e6)
     -> service_name: Str(otelcorecol)
     -> service_version: Str(0.89.0-dev)

One more question that ^^^ raises is whether or not this is correct. As you can see there is both a service.name and a service_name attribute which are different, service.name is the prometheus job name configured in the receiver, service_name is the collector's target_info details.

@jpkrohling
Copy link
Member

target_info should be present and new metrics should not include resource labels
...
Maybe the solution is to have a configuration option to use the legacy format?

Perhaps a feature flag can control the three possible scenarios? No target info + all resource attributes on metrics, target info and minimal set of resource attributes on metrics, target info and all resource attributes.

The third scenario would help people migrate from the previous to the new one, as they can just ignore the extra resource attributes when building the integration. Once they are ready, they can move to the scenario 2, which is what will be supported for the long term.

@dashpole
Copy link
Contributor

One more question that ^^^ raises is whether or not this is correct.

This is correct. Unfortunately, until prom supports utf-8, we can't auto-translate service_name -> service.name. I would recommend using an additional processor to overwrite service.name with service_name, etc.

@codeboten
Copy link
Contributor Author

target_info should be present and new metrics should not include resource labels
...
Maybe the solution is to have a configuration option to use the legacy format?

Perhaps a feature flag can control the three possible scenarios? No target info + all resource attributes on metrics, target info and minimal set of resource attributes on metrics, target info and all resource attributes.

The third scenario would help people migrate from the previous to the new one, as they can just ignore the extra resource attributes when building the integration. Once they are ready, they can move to the scenario 2, which is what will be supported for the long term.

Are you suggesting a single gate that could be used for all 3 scenarios? If so then I would recommend we:

  1. wait until we can produce metrics that are identical using OTel,
  2. enable otel by default and deprecate this feature gate
  3. create a new gate that allows users to change the output (open to ideas for name here)
  4. enable this new gate as the default some time in the future

In other words, I would like to decouple people having to change the metrics they're using from enabling otel. That being said, it's a bit weird to enable otel with the old style metrics, but at least it unblocks that work

@codeboten
Copy link
Contributor Author

@dashpole with open-telemetry/opentelemetry-specification#3761 close to merging, do you have any thoughts on how soon this could be adopted in the otel-go prometheus exporter? i would like to avoid adding workarounds in the collector code if possible.

@dashpole
Copy link
Contributor

I would guess a week or two? I don't think it will be hard to implement.

pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this issue Dec 8, 2023
…etry#8965)

This ensures backwards compatibility with the metrics generated via
opencensus by default today.

Part of open-telemetry#7454

Signed-off-by: Alex Boten <[email protected]>
codeboten pushed a commit that referenced this issue Dec 12, 2023
The metrics are now consistent with the metrics produced by OpenCensus.
We should move the featuregate forward.

Note that the OpenTelemetry generated metrics includes grpc
client/server metrics (for receivers/exporters that use grpc) and
`target_info` metrics

Fixes
#7454

---------

Signed-off-by: Alex Boten <[email protected]>
sokoide pushed a commit to sokoide/opentelemetry-collector that referenced this issue Dec 18, 2023
The metrics are now consistent with the metrics produced by OpenCensus.
We should move the featuregate forward.

Note that the OpenTelemetry generated metrics includes grpc
client/server metrics (for receivers/exporters that use grpc) and
`target_info` metrics

Fixes
open-telemetry#7454

---------

Signed-off-by: Alex Boten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants