Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supportability: print own metrics in logs #2098

Closed
tigrannajaryan opened this issue Nov 9, 2020 · 15 comments
Closed

Supportability: print own metrics in logs #2098

tigrannajaryan opened this issue Nov 9, 2020 · 15 comments

Comments

@tigrannajaryan
Copy link
Member

The Collector's own metrics are an important source of information for troubleshooting. Typically own metrics are scraped using a Prometheus receiver and send to a backend where they can be examined. However, this only works if the Collector's metric pipeline works and the backend is available. If this is not the case, which often happens when the Collector is misconfigured and cannot send metrics or the backend is not available, then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to output own metrics in a log file. In difficult situations the local Collector log is the important source of troubleshooting information. We want to periodically log the metric values in a human readable form.

Note: we need to be careful to not flood the logs.

@tigrannajaryan tigrannajaryan added the bug Something isn't working label Nov 9, 2020
@tigrannajaryan tigrannajaryan self-assigned this Nov 9, 2020
@tigrannajaryan tigrannajaryan added feature request and removed bug Something isn't working labels Nov 10, 2020
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 10, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 10, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 10, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 10, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
@bogdandrutu
Copy link
Member

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 12, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 12, 2020
Resolves open-telemetry#2098

The Collector's own metrics are an important source of information for troubleshooting.
Typically own metrics are scraped using a Prometheus receiver and send to a backend where
they can be examined. However, this only works if the Collector's metric pipeline works
and the backend is available. If this is not the case, which often happens when the
Collector is misconfigured and cannot send metrics or the backend is not available,
then these metrics are not possible to see anywhere.

In an effort to improve supportability of the Collector in such situations we want to
output own metrics in a log file. In difficult situations the local Collector log is
the important source of troubleshooting information. We periodically log the
metric values in a human readable form.

When --log-level=debug command line option is passed the metrics are logged as often
as they are exported by stats (currently every 10 seconds). If the --log-level is not
set to debug then metrics are logged once every 5 minutes.

Sample output:

```
2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------
```
@tigrannajaryan
Copy link
Member Author

tigrannajaryan commented Nov 12, 2020

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

Metrics values don't exist in Prometheus unless you scrape and collect them. Scarping metrics requires a working metric collection system which may not exist. Collector is an essential part of telemetry collection pipeline and its troubleshooting cannot solely rely on a functioning collection system. You need to be able to troubleshoot the Collector even if all of your telemetry collection is broken. Logs are self-contained and enable troubleshooting in broken situations.

@tigrannajaryan
Copy link
Member Author

A workaround is for user to go to the metrics page exposed by Prometheus. Why are logs better for this?

If you refer to the Prometheus receiver page exposed by the Collector itself then I already addressed that in the PR: Prometheus receiver page is only available when the Collector is running, which is fine for live debugging. Logs outlive the Collector process and can be used for postmortem troubleshooting.

@tigrannajaryan
Copy link
Member Author

@bogdandrutu any thoughts on this?

@jpkrohling
Copy link
Member

If you refer to the Prometheus receiver page exposed by the Collector itself then I already addressed that in the PR: Prometheus receiver page is only available when the Collector is running, which is fine for live debugging. Logs outlive the Collector process and can be used for postmortem troubleshooting.

I believe that having a working metrics infra is a requirement for whatever use case that involves understanding outages at a later stage. People are already expected to have this in place and we shouldn't deviate from that: adding one more source of information would be confusing, IMO.

Exposing an OpenMetrics or Prometheus compatible endpoint at /metrics is more than sufficient. If we ever hear from a user that needs metrics in logs, we can re-evaluate this.

@jpkrohling
Copy link
Member

Coincidentally, we can already see how annoying it is to have something like this, by using the latest OpenTelemetry Collector with the default configuration. See #2212.

@tigrannajaryan
Copy link
Member Author

Coincidentally, we can already see how annoying it is to have something like this, by using the latest OpenTelemetry Collector with the default configuration. See #2212.

I only partially agree. The output in #2212 is not very readable and thus not very useful. When done nicely it can be useful. The output of #2103 is more human readable and less noisy and is only printed once every 5 mins. Definitely not as annoying as #2212

@jpkrohling
Copy link
Member

Only once every 5 minutes is still quite noisy when you have a long running service with little to no error/warning/info level messages. To me, it still sounds like something that would be useful during the initial setup of the infra, not for production usage.

@tigrannajaryan
Copy link
Member Author

To me, it still sounds like something that would be useful during the initial setup of the infra, not for production usage.

@jpkrohling I agree with you. However, I believe the initial setup is a very important use use. Users often struggle with the setup and when things don't work, when Collector is unable to send any data we currently don't offer much help to users. I think we need to do more to make first time experience nicer. This is part of that first time experience.

Only once every 5 minutes is still quite noisy when you have a long running service with little to no error/warning/info level messages.

I fail to see how one log record (although a large one) is a such an amount of noise that creates a problem. Can you explain what the problem is, why is one informational log record every 5 mins is a problem? Are you suggesting that we should never log an informational messages or that we can log informational messages but once every 5 mins is too often or you suggest that we never log any informational messages?

@jpkrohling
Copy link
Member

First-time experience != production experience. Agree that the collector doesn't offer much help at the moment, especially when it swallows errors, but printing metrics to the logs isn't the way to go for the production case, IMO. In production, people already (should) have a metrics backend, such as Prometheus. Even when they don't, it's common practice to just do a cURL to /metrics on the service being configured. In fact, that's what I do when I'm developing a component, to make sure data is flowing through the pipeline. Printing out metrics in the logs isn't what people would expect, but if you have examples of other projects doing this, it would help serve as precedence.

I fail to see how one log record (although a large one) is a such an amount of noise that creates a problem.

If this one log record is in a single line, then I can easily grep it out and focus on what I'm trying to find. I wouldn't like this line, but I wouldn't oppose to having one.

But I can't easily grep those lines out, and I'll end up fighting with the logs when I need them most. From what I understand, here's what I would see periodically:

2020-11-10T09:59:34.884-0500	INFO	service/telemetry_log.go:234
  Internal Metrics:
  Metric                                            | Value
  --------------------------------------------------|--------------------------------
  exporter/send_failed_log_records                  |
  exporter/sent_log_records                         |
  fluent_closed_connections                         |
  fluent_events_parsed                              |
  fluent_opened_connections                         |
  fluent_parse_failures                             |
  fluent_records_generated                          |
  grpc.io/client/completed_rpcs                     |
  grpc.io/client/received_bytes_per_rpc             |
  grpc.io/client/received_messages_per_rpc          |
  grpc.io/client/roundtrip_latency                  |
  grpc.io/client/sent_bytes_per_rpc                 |
  grpc.io/client/sent_messages_per_rpc              |
  grpc.io/server/completed_rpcs                     |
  grpc.io/server/received_bytes_per_rpc             |
  grpc.io/server/received_messages_per_rpc          |
  grpc.io/server/sent_bytes_per_rpc                 |
  grpc.io/server/sent_messages_per_rpc              |
  grpc.io/server/server_latency                     |
  kafka_receiver_current_offset                     |
  kafka_receiver_messages                           |
  kafka_receiver_offset_lag                         |
  kafka_receiver_partition_close                    |
  kafka_receiver_partition_start                    |
  process/cpu_seconds                               |            0
  process/memory/rss                                |   44,625,920
  process/runtime/heap_alloc_bytes                  |   13,168,120 By
  process/runtime/total_alloc_bytes                 |   28,170,760 By
  process/runtime/total_sys_memory_bytes            |   76,366,848 By
  process/uptime                                    |    55.006789 s
  processor/accepted_log_records                    |
  processor/accepted_metric_points                  |
  processor/accepted_spans                          |
  processor/batch/batch_send_size_bytes             |
  processor/batch/batch_size_trigger_send           |
  processor/batches_received                        |
  processor/dropped_log_records                     |
  processor/dropped_metric_points                   |
  processor/dropped_spans                           |
  processor/queued_retry/fail_send                  |
  processor/queued_retry/queue_latency              |
  processor/queued_retry/queue_length               |
  processor/queued_retry/send_latency               |
  processor/queued_retry/success_send               |
  processor/refused_log_records                     |
  processor/refused_metric_points                   |
  processor/refused_spans                           |
  processor/spans_dropped                           |
  processor/spans_received                          |
  processor/trace_batches_dropped                   |
  receiver/accepted_log_records                     |
  receiver/refused_log_records                      |
  scraper/errored_metric_points                     |
  scraper/scraped_metric_points                     |
  --------------------------------------------------|--------------------------------

  Component/Dimensions                              | Metric                                  | Value
  --------------------------------------------------|-----------------------------------------|--------------------------------
  exporter=otlphttp                                 | exporter/send_failed_metric_points      |           57
  exporter=otlphttp                                 | exporter/send_failed_spans              |        1,085
  exporter=otlphttp                                 | exporter/sent_metric_points             |            0
  exporter=otlphttp                                 | exporter/sent_spans                     |            0
  processor=batch                                   | processor/batch/batch_send_size         | 2/33.439024/80 (min/mean/max)	Occurrences=41
  processor=batch                                   | processor/batch/timeout_trigger_send    |           41
  receiver=jaeger, transport=collector_http         | receiver/accepted_spans                 |        1,306
  receiver=jaeger, transport=collector_http         | receiver/refused_spans                  |            0
  receiver=prometheus, transport=http               | receiver/accepted_metric_points         |           77
  receiver=prometheus, transport=http               | receiver/refused_metric_points          |            0
  --------------------------------------------------|-----------------------------------------|--------------------------------

Info-level messages are perfectly fine if they help me understand what's going on when I need to read the logs. In this case, there isn't a specific reason the metrics are printed out (no problem was detected, no event has been triggered, nothing really happened), so, they aren't pointing me to a specific direction. Worse, they might be distracting me and making my life harder to find the log entries that will help me understand what's going on.

@tigrannajaryan
Copy link
Member Author

IMO. In production, people already (should) have a metrics backend, such as Prometheus. Even when they don't, it's common practice to just do a cURL to /metrics on the service being configured. In fact, that's what I do when I'm developing a component, to make sure data is flowing through the pipeline.

@jpkrohling This is a good approach when the Collector is running. However, the use case I am trying to tackle here is when the Collector is dead and you are doing postmortem analysis.

Printing out metrics in the logs isn't what people would expect, but if you have examples of other projects doing this, it would help serve as precedence.

I have seen this done successfully with several proprietary products I worked in the past on and such logs were very useful for troubleshooting customer problems. None of those are open-source so I can't give any public links.

Just to clarify: I don't necessarily insist that this is the only right solution. I am happy to see other proposals that solve the problem. Let me restate the problem:

I want to be able to troubleshoot the Collector that I am not running myself, that someone else is running and that person is not an expert in the Collector. It is very likely that I am going to learn about the problem after it already happened, when the Collector may have OOM-ed or crashed or restarted. This is a scenario when I don't have direct access to the environment where the Collector runs. It is very likely that in such situation the best I can get are the logs from the previous run.

We have also chatted with @bogdandrutu about this. He offered to dump to logs all useful information when the Collector crashes or OOMs or panics. This is a possible approach but it is not guaranteed to work since e.g. in an OOM situation you may not be able to do much work.

@jpkrohling I am open to any other suggestions you may have to help with this particular troubleshooting scenario. But, please make sure you think how you will troubleshoot problems that happen on a customer environment to which you have no access.

@jpkrohling
Copy link
Member

jpkrohling commented Dec 1, 2020

However, the use case I am trying to tackle here is when the Collector is dead and you are doing postmortem analysis.

That's another good reason to have an observability stack, or at least a metrics backend such as Prometheus: it can survive outages of the OpenTelemetry Collector. And better than logs, they can survive in case the instance/pod simply disappears/dies, which would make logs ineffective as well ;-) As a bonus points, people can easily see the evolution of the metrics in a graph, instead of trying to understand their evolution by reading log entries.

I want to be able to troubleshoot the Collector that I am not running myself, that someone else is running and that person is not an expert in the Collector.

This is actually very similar to what we have in Jaeger, both in the community and as a supported offering. My experience tells me that most of the problems people have are setup, configuration or environment issues. Reproducing those issues are typically straightforward once we have the configuration files used and enough information about the environment, such as the version of the components involved, the storage mechanism being used, and so on. Metrics wouldn't help in those cases. A variation of this is when users are confused about a specific behavior, and turning on the debug logs can explain what's going on.

Then there are the cases that only appear when we have very specific conditions in terms of load. So far, I can remember only one issue in Jaeger that we couldn't solve with logs + configuration files: jaegertracing/jaeger#2452 (and a few other issues that seem related). In that case, metrics did help in showing that there was a problem, but didn't help understanding the root cause.

I'm skeptical about this proposal here because we'd be imposing a high tax on all production instances of the OpenTelemetry Collector for all users out there, only because we think metrics in the logs might help in a few specific cases for an even smaller subset of users. IMO, it's not unreasonable to ask users to turn debug-level on for the logs if they are experiencing an issue consistently.

I am open to any other suggestions you may have to help with this particular troubleshooting scenario.

My suggestion won't help people who don't have a monitoring or observability stack in place, but I trust that users of the OpenTelemetry Collector know the value that observability tools provide. I would really start with the premise that our users do monitor their production workloads with something that is able to scrape our metrics endpoint. Given this premise, my suggestions are:

@tigrannajaryan
Copy link
Member Author

@jpkrohling good suggestions, thank you. Would you want to work on any of these yourself?

@jpkrohling
Copy link
Member

jpkrohling commented Dec 3, 2020

I can commit to work on the first and second points for the OTLP and Jaeger receivers and exporters, as well as for the processors I own.
I can also commit to kick start the doc pages in the website for the third and fourth points.

About the timing: I can try to schedule this for my next sprint, which starts in the Calendar Week 51, but it's a very short sprint due to the holidays. In the worst case, I can get this done by Calendar Week 2 next year (Jan 15).

@jpkrohling
Copy link
Member

I opened four issues, each tracking a suggestion I've made. I'll start working on them in the given order.

@tigrannajaryan tigrannajaryan removed their assignment Feb 3, 2021
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
* Adopt discovery receiver

also fix a race condition with empty observer ID in correlation

* fix discovery receiver test flake
@github-actions github-actions bot added the Stale label Dec 21, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants