-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supportability: print own metrics in logs #2098
Comments
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this? |
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
Resolves open-telemetry#2098 The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere. In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We periodically log the metric values in a human readable form. When --log-level=debug command line option is passed the metrics are logged as often as they are exported by stats (currently every 10 seconds). If the --log-level is not set to debug then metrics are logged once every 5 minutes. Sample output: ``` 2020-11-10T09:59:34.884-0500 INFO service/telemetry_log.go:234 Internal Metrics: Metric | Value --------------------------------------------------|-------------------------------- exporter/send_failed_log_records | exporter/sent_log_records | fluent_closed_connections | fluent_events_parsed | fluent_opened_connections | fluent_parse_failures | fluent_records_generated | grpc.io/client/completed_rpcs | grpc.io/client/received_bytes_per_rpc | grpc.io/client/received_messages_per_rpc | grpc.io/client/roundtrip_latency | grpc.io/client/sent_bytes_per_rpc | grpc.io/client/sent_messages_per_rpc | grpc.io/server/completed_rpcs | grpc.io/server/received_bytes_per_rpc | grpc.io/server/received_messages_per_rpc | grpc.io/server/sent_bytes_per_rpc | grpc.io/server/sent_messages_per_rpc | grpc.io/server/server_latency | kafka_receiver_current_offset | kafka_receiver_messages | kafka_receiver_offset_lag | kafka_receiver_partition_close | kafka_receiver_partition_start | process/cpu_seconds | 0 process/memory/rss | 44,625,920 process/runtime/heap_alloc_bytes | 13,168,120 By process/runtime/total_alloc_bytes | 28,170,760 By process/runtime/total_sys_memory_bytes | 76,366,848 By process/uptime | 55.006789 s processor/accepted_log_records | processor/accepted_metric_points | processor/accepted_spans | processor/batch/batch_send_size_bytes | processor/batch/batch_size_trigger_send | processor/batches_received | processor/dropped_log_records | processor/dropped_metric_points | processor/dropped_spans | processor/queued_retry/fail_send | processor/queued_retry/queue_latency | processor/queued_retry/queue_length | processor/queued_retry/send_latency | processor/queued_retry/success_send | processor/refused_log_records | processor/refused_metric_points | processor/refused_spans | processor/spans_dropped | processor/spans_received | processor/trace_batches_dropped | receiver/accepted_log_records | receiver/refused_log_records | scraper/errored_metric_points | scraper/scraped_metric_points | --------------------------------------------------|-------------------------------- Component/Dimensions | Metric | Value --------------------------------------------------|-----------------------------------------|-------------------------------- exporter=otlphttp | exporter/send_failed_metric_points | 57 exporter=otlphttp | exporter/send_failed_spans | 1,085 exporter=otlphttp | exporter/sent_metric_points | 0 exporter=otlphttp | exporter/sent_spans | 0 processor=batch | processor/batch/batch_send_size | 2/33.439024/80 (min/mean/max) Occurrences=41 processor=batch | processor/batch/timeout_trigger_send | 41 receiver=jaeger, transport=collector_http | receiver/accepted_spans | 1,306 receiver=jaeger, transport=collector_http | receiver/refused_spans | 0 receiver=prometheus, transport=http | receiver/accepted_metric_points | 77 receiver=prometheus, transport=http | receiver/refused_metric_points | 0 --------------------------------------------------|-----------------------------------------|-------------------------------- ```
Metrics values don't exist in Prometheus unless you scrape and collect them. Scarping metrics requires a working metric collection system which may not exist. Collector is an essential part of telemetry collection pipeline and its troubleshooting cannot solely rely on a functioning collection system. You need to be able to troubleshoot the Collector even if all of your telemetry collection is broken. Logs are self-contained and enable troubleshooting in broken situations. |
If you refer to the Prometheus receiver page exposed by the Collector itself then I already addressed that in the PR: Prometheus receiver page is only available when the Collector is running, which is fine for live debugging. Logs outlive the Collector process and can be used for postmortem troubleshooting. |
@bogdandrutu any thoughts on this? |
I believe that having a working metrics infra is a requirement for whatever use case that involves understanding outages at a later stage. People are already expected to have this in place and we shouldn't deviate from that: adding one more source of information would be confusing, IMO. Exposing an OpenMetrics or Prometheus compatible endpoint at |
Coincidentally, we can already see how annoying it is to have something like this, by using the latest OpenTelemetry Collector with the default configuration. See #2212. |
I only partially agree. The output in #2212 is not very readable and thus not very useful. When done nicely it can be useful. The output of #2103 is more human readable and less noisy and is only printed once every 5 mins. Definitely not as annoying as #2212 |
Only once every 5 minutes is still quite noisy when you have a long running service with little to no error/warning/info level messages. To me, it still sounds like something that would be useful during the initial setup of the infra, not for production usage. |
@jpkrohling I agree with you. However, I believe the initial setup is a very important use use. Users often struggle with the setup and when things don't work, when Collector is unable to send any data we currently don't offer much help to users. I think we need to do more to make first time experience nicer. This is part of that first time experience.
I fail to see how one log record (although a large one) is a such an amount of noise that creates a problem. Can you explain what the problem is, why is one informational log record every 5 mins is a problem? Are you suggesting that we should never log an informational messages or that we can log informational messages but once every 5 mins is too often or you suggest that we never log any informational messages? |
First-time experience != production experience. Agree that the collector doesn't offer much help at the moment, especially when it swallows errors, but printing metrics to the logs isn't the way to go for the production case, IMO. In production, people already (should) have a metrics backend, such as Prometheus. Even when they don't, it's common practice to just do a cURL to
If this one log record is in a single line, then I can easily grep it out and focus on what I'm trying to find. I wouldn't like this line, but I wouldn't oppose to having one. But I can't easily grep those lines out, and I'll end up fighting with the logs when I need them most. From what I understand, here's what I would see periodically:
Info-level messages are perfectly fine if they help me understand what's going on when I need to read the logs. In this case, there isn't a specific reason the metrics are printed out (no problem was detected, no event has been triggered, nothing really happened), so, they aren't pointing me to a specific direction. Worse, they might be distracting me and making my life harder to find the log entries that will help me understand what's going on. |
@jpkrohling This is a good approach when the Collector is running. However, the use case I am trying to tackle here is when the Collector is dead and you are doing postmortem analysis.
I have seen this done successfully with several proprietary products I worked in the past on and such logs were very useful for troubleshooting customer problems. None of those are open-source so I can't give any public links. Just to clarify: I don't necessarily insist that this is the only right solution. I am happy to see other proposals that solve the problem. Let me restate the problem: I want to be able to troubleshoot the Collector that I am not running myself, that someone else is running and that person is not an expert in the Collector. It is very likely that I am going to learn about the problem after it already happened, when the Collector may have OOM-ed or crashed or restarted. This is a scenario when I don't have direct access to the environment where the Collector runs. It is very likely that in such situation the best I can get are the logs from the previous run. We have also chatted with @bogdandrutu about this. He offered to dump to logs all useful information when the Collector crashes or OOMs or panics. This is a possible approach but it is not guaranteed to work since e.g. in an OOM situation you may not be able to do much work. @jpkrohling I am open to any other suggestions you may have to help with this particular troubleshooting scenario. But, please make sure you think how you will troubleshoot problems that happen on a customer environment to which you have no access. |
That's another good reason to have an observability stack, or at least a metrics backend such as Prometheus: it can survive outages of the OpenTelemetry Collector. And better than logs, they can survive in case the instance/pod simply disappears/dies, which would make logs ineffective as well ;-) As a bonus points, people can easily see the evolution of the metrics in a graph, instead of trying to understand their evolution by reading log entries.
This is actually very similar to what we have in Jaeger, both in the community and as a supported offering. My experience tells me that most of the problems people have are setup, configuration or environment issues. Reproducing those issues are typically straightforward once we have the configuration files used and enough information about the environment, such as the version of the components involved, the storage mechanism being used, and so on. Metrics wouldn't help in those cases. A variation of this is when users are confused about a specific behavior, and turning on the debug logs can explain what's going on. Then there are the cases that only appear when we have very specific conditions in terms of load. So far, I can remember only one issue in Jaeger that we couldn't solve with logs + configuration files: jaegertracing/jaeger#2452 (and a few other issues that seem related). In that case, metrics did help in showing that there was a problem, but didn't help understanding the root cause. I'm skeptical about this proposal here because we'd be imposing a high tax on all production instances of the OpenTelemetry Collector for all users out there, only because we think metrics in the logs might help in a few specific cases for an even smaller subset of users. IMO, it's not unreasonable to ask users to turn debug-level on for the logs if they are experiencing an issue consistently.
My suggestion won't help people who don't have a monitoring or observability stack in place, but I trust that users of the OpenTelemetry Collector know the value that observability tools provide. I would really start with the premise that our users do monitor their production workloads with something that is able to scrape our metrics endpoint. Given this premise, my suggestions are:
|
@jpkrohling good suggestions, thank you. Would you want to work on any of these yourself? |
I can commit to work on the first and second points for the OTLP and Jaeger receivers and exporters, as well as for the processors I own. About the timing: I can try to schedule this for my next sprint, which starts in the Calendar Week 51, but it's a very short sprint due to the holidays. In the worst case, I can get this done by Calendar Week 2 next year (Jan 15). |
I opened four issues, each tracking a suggestion I've made. I'll start working on them in the given order. |
* Adopt discovery receiver also fix a race condition with empty observer ID in correlation * fix discovery receiver test flake
The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere.
In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We want to periodically log the metric values in a human readable form.
Note: we need to be careful to not flood the logs.
The text was updated successfully, but these errors were encountered: