Supportability: print own metrics in logs #2098

tigrannajaryan · 2020-11-09T21:40:45Z

The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We want to periodically log the metric values in a human readable form.

Note: we need to be careful to not flood the logs.

Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```

bogdandrutu · 2020-11-12T21:19:42Z

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```

tigrannajaryan · 2020-11-12T22:37:39Z

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

Metrics values don't exist in Prometheus unless you scrape and collect them. Scarping metrics requires a working metric collection system which may not exist. Collector is an essential part of telemetry collection pipeline and its troubleshooting cannot solely rely on a functioning collection system. You need to be able to troubleshoot the Collector even if all of your telemetry collection is broken. Logs are self-contained and enable troubleshooting in broken situations.

tigrannajaryan · 2020-11-12T22:43:03Z

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

If you refer to the Prometheus receiver page exposed by the Collector itself then I already addressed that in the PR: Prometheus receiver page is only available when the Collector is running, which is fine for live debugging. Logs outlive the Collector process and can be used for postmortem troubleshooting.

tigrannajaryan · 2020-11-18T16:11:26Z

@bogdandrutu any thoughts on this?

jpkrohling · 2020-11-23T11:24:55Z

If you refer to the Prometheus receiver page exposed by the Collector itself then I already addressed that in the PR: Prometheus receiver page is only available when the Collector is running, which is fine for live debugging. Logs outlive the Collector process and can be used for postmortem troubleshooting.

I believe that having a working metrics infra is a requirement for whatever use case that involves understanding outages at a later stage. People are already expected to have this in place and we shouldn't deviate from that: adding one more source of information would be confusing, IMO.

Exposing an OpenMetrics or Prometheus compatible endpoint at /metrics is more than sufficient. If we ever hear from a user that needs metrics in logs, we can re-evaluate this.

jpkrohling · 2020-11-25T08:32:54Z

Coincidentally, we can already see how annoying it is to have something like this, by using the latest OpenTelemetry Collector with the default configuration. See #2212.

tigrannajaryan · 2020-11-25T19:56:59Z

Coincidentally, we can already see how annoying it is to have something like this, by using the latest OpenTelemetry Collector with the default configuration. See #2212.

I only partially agree. The output in #2212 is not very readable and thus not very useful. When done nicely it can be useful. The output of #2103 is more human readable and less noisy and is only printed once every 5 mins. Definitely not as annoying as #2212

jpkrohling · 2020-11-25T20:15:55Z

Only once every 5 minutes is still quite noisy when you have a long running service with little to no error/warning/info level messages. To me, it still sounds like something that would be useful during the initial setup of the infra, not for production usage.

tigrannajaryan · 2020-11-27T17:03:46Z

To me, it still sounds like something that would be useful during the initial setup of the infra, not for production usage.

@jpkrohling I agree with you. However, I believe the initial setup is a very important use use. Users often struggle with the setup and when things don't work, when Collector is unable to send any data we currently don't offer much help to users. I think we need to do more to make first time experience nicer. This is part of that first time experience.

Only once every 5 minutes is still quite noisy when you have a long running service with little to no error/warning/info level messages.

I fail to see how one log record (although a large one) is a such an amount of noise that creates a problem. Can you explain what the problem is, why is one informational log record every 5 mins is a problem? Are you suggesting that we should never log an informational messages or that we can log informational messages but once every 5 mins is too often or you suggest that we never log any informational messages?

jpkrohling · 2020-11-30T10:07:08Z

First-time experience != production experience. Agree that the collector doesn't offer much help at the moment, especially when it swallows errors, but printing metrics to the logs isn't the way to go for the production case, IMO. In production, people already (should) have a metrics backend, such as Prometheus. Even when they don't, it's common practice to just do a cURL to /metrics on the service being configured. In fact, that's what I do when I'm developing a component, to make sure data is flowing through the pipeline. Printing out metrics in the logs isn't what people would expect, but if you have examples of other projects doing this, it would help serve as precedence.

I fail to see how one log record (although a large one) is a such an amount of noise that creates a problem.

If this one log record is in a single line, then I can easily grep it out and focus on what I'm trying to find. I wouldn't like this line, but I wouldn't oppose to having one.

But I can't easily grep those lines out, and I'll end up fighting with the logs when I need them most. From what I understand, here's what I would see periodically:

2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------

Info-level messages are perfectly fine if they help me understand what's going on when I need to read the logs. In this case, there isn't a specific reason the metrics are printed out (no problem was detected, no event has been triggered, nothing really happened), so, they aren't pointing me to a specific direction. Worse, they might be distracting me and making my life harder to find the log entries that will help me understand what's going on.

tigrannajaryan · 2020-12-01T16:46:16Z

IMO. In production, people already (should) have a metrics backend, such as Prometheus. Even when they don't, it's common practice to just do a cURL to /metrics on the service being configured. In fact, that's what I do when I'm developing a component, to make sure data is flowing through the pipeline.

@jpkrohling This is a good approach when the Collector is running. However, the use case I am trying to tackle here is when the Collector is dead and you are doing postmortem analysis.

Printing out metrics in the logs isn't what people would expect, but if you have examples of other projects doing this, it would help serve as precedence.

I have seen this done successfully with several proprietary products I worked in the past on and such logs were very useful for troubleshooting customer problems. None of those are open-source so I can't give any public links.

Just to clarify: I don't necessarily insist that this is the only right solution. I am happy to see other proposals that solve the problem. Let me restate the problem:

I want to be able to troubleshoot the Collector that I am not running myself, that someone else is running and that person is not an expert in the Collector. It is very likely that I am going to learn about the problem after it already happened, when the Collector may have OOM-ed or crashed or restarted. This is a scenario when I don't have direct access to the environment where the Collector runs. It is very likely that in such situation the best I can get are the logs from the previous run.

We have also chatted with @bogdandrutu about this. He offered to dump to logs all useful information when the Collector crashes or OOMs or panics. This is a possible approach but it is not guaranteed to work since e.g. in an OOM situation you may not be able to do much work.

@jpkrohling I am open to any other suggestions you may have to help with this particular troubleshooting scenario. But, please make sure you think how you will troubleshoot problems that happen on a customer environment to which you have no access.

jpkrohling · 2020-12-01T21:06:56Z

However, the use case I am trying to tackle here is when the Collector is dead and you are doing postmortem analysis.

That's another good reason to have an observability stack, or at least a metrics backend such as Prometheus: it can survive outages of the OpenTelemetry Collector. And better than logs, they can survive in case the instance/pod simply disappears/dies, which would make logs ineffective as well ;-) As a bonus points, people can easily see the evolution of the metrics in a graph, instead of trying to understand their evolution by reading log entries.

I want to be able to troubleshoot the Collector that I am not running myself, that someone else is running and that person is not an expert in the Collector.

This is actually very similar to what we have in Jaeger, both in the community and as a supported offering. My experience tells me that most of the problems people have are setup, configuration or environment issues. Reproducing those issues are typically straightforward once we have the configuration files used and enough information about the environment, such as the version of the components involved, the storage mechanism being used, and so on. Metrics wouldn't help in those cases. A variation of this is when users are confused about a specific behavior, and turning on the debug logs can explain what's going on.

Then there are the cases that only appear when we have very specific conditions in terms of load. So far, I can remember only one issue in Jaeger that we couldn't solve with logs + configuration files: jaegertracing/jaeger#2452 (and a few other issues that seem related). In that case, metrics did help in showing that there was a problem, but didn't help understanding the root cause.

I'm skeptical about this proposal here because we'd be imposing a high tax on all production instances of the OpenTelemetry Collector for all users out there, only because we think metrics in the logs might help in a few specific cases for an even smaller subset of users. IMO, it's not unreasonable to ask users to turn debug-level on for the logs if they are experiencing an issue consistently.

I am open to any other suggestions you may have to help with this particular troubleshooting scenario.

My suggestion won't help people who don't have a monitoring or observability stack in place, but I trust that users of the OpenTelemetry Collector know the value that observability tools provide. I would really start with the premise that our users do monitor their production workloads with something that is able to scrape our metrics endpoint. Given this premise, my suggestions are:

Supportability: Revisit the metrics we have #2430 - Revisit the metrics we have and see if they are enough. They most likely aren't, but we'll only realize that once we do failure testing.
Supportability: Revisit the log entries we have #2431 - Same as above, but with logs. One concrete example that I'm working on right now as part of Jaeger exporter causes Jaeger to throw "received bogus greeting" #1861 is about telling users about the state of the connection of the collector exporters with its Jaeger backend(s). Another rule we should try to follow: we should also always log errors, never swallow them.
Supportability: Document the metrics people can use for troubleshooting #2432 - Produce documentation about the collector metrics. Like this and this
Supportability: Improve troubleshooting guide #2433 - Produce a troubleshooting guide, like this

tigrannajaryan · 2020-12-02T20:54:32Z

@jpkrohling good suggestions, thank you. Would you want to work on any of these yourself?

jpkrohling · 2020-12-03T09:26:24Z

I can commit to work on the first and second points for the OTLP and Jaeger receivers and exporters, as well as for the processors I own.
I can also commit to kick start the doc pages in the website for the third and fourth points.

About the timing: I can try to schedule this for my next sprint, which starts in the Calendar Week 51, but it's a very short sprint due to the holidays. In the worst case, I can get this done by Calendar Week 2 next year (Jan 15).

jpkrohling · 2021-02-02T10:36:00Z

I opened four issues, each tracking a suggestion I've made. I'll start working on them in the given order.

* Adopt discovery receiver also fix a race condition with empty observer ID in correlation * fix discovery receiver test flake

tigrannajaryan added the bug Something isn't working label Nov 9, 2020

tigrannajaryan self-assigned this Nov 9, 2020

tigrannajaryan added feature request and removed bug Something isn't working labels Nov 10, 2020

tigrannajaryan mentioned this issue Nov 10, 2020

Supportability: improve own logs #2102

Closed

6 tasks

tigrannajaryan mentioned this issue Nov 10, 2020

Supportability: print own metrics in logs #2103

Closed

andrewhsu added priority:p2 Medium spec:logs labels Nov 11, 2020

andrewhsu removed the feature request label Dec 3, 2020

andrewhsu added enhancement New feature or request area:miscellaneous release:allowed-for-ga labels Dec 3, 2020

This was referenced Feb 2, 2021

Supportability: Revisit the metrics we have #2430

Closed

Supportability: Revisit the log entries we have #2431

Closed

Supportability: Document the metrics people can use for troubleshooting #2432

Closed

tigrannajaryan removed their assignment Feb 3, 2021

tigrannajaryan removed spec:trace labels Dec 17, 2021

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

Adopt discovery receiver (open-telemetry#2098)

4dc3e64

* Adopt discovery receiver also fix a race condition with empty observer ID in correlation * fix discovery receiver test flake

github-actions bot added the Stale label Dec 21, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supportability: print own metrics in logs #2098

Supportability: print own metrics in logs #2098

tigrannajaryan commented Nov 9, 2020

bogdandrutu commented Nov 12, 2020

tigrannajaryan commented Nov 12, 2020 •

edited

Loading

tigrannajaryan commented Nov 12, 2020

tigrannajaryan commented Nov 18, 2020

jpkrohling commented Nov 23, 2020

jpkrohling commented Nov 25, 2020

tigrannajaryan commented Nov 25, 2020

jpkrohling commented Nov 25, 2020

tigrannajaryan commented Nov 27, 2020

jpkrohling commented Nov 30, 2020

tigrannajaryan commented Dec 1, 2020

jpkrohling commented Dec 1, 2020 •

edited

Loading

tigrannajaryan commented Dec 2, 2020

jpkrohling commented Dec 3, 2020 •

edited

Loading

jpkrohling commented Feb 2, 2021

Supportability: print own metrics in logs #2098

Supportability: print own metrics in logs #2098

Comments

tigrannajaryan commented Nov 9, 2020

bogdandrutu commented Nov 12, 2020

tigrannajaryan commented Nov 12, 2020 • edited Loading

tigrannajaryan commented Nov 12, 2020

tigrannajaryan commented Nov 18, 2020

jpkrohling commented Nov 23, 2020

jpkrohling commented Nov 25, 2020

tigrannajaryan commented Nov 25, 2020

jpkrohling commented Nov 25, 2020

tigrannajaryan commented Nov 27, 2020

jpkrohling commented Nov 30, 2020

tigrannajaryan commented Dec 1, 2020

jpkrohling commented Dec 1, 2020 • edited Loading

tigrannajaryan commented Dec 2, 2020

jpkrohling commented Dec 3, 2020 • edited Loading

jpkrohling commented Feb 2, 2021

tigrannajaryan commented Nov 12, 2020 •

edited

Loading

jpkrohling commented Dec 1, 2020 •

edited

Loading

jpkrohling commented Dec 3, 2020 •

edited

Loading