Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic and SIGSEGV #982

Closed
beegmon opened this issue Feb 21, 2022 · 16 comments
Closed

Panic and SIGSEGV #982

beegmon opened this issue Feb 21, 2022 · 16 comments
Assignees
Labels
otel-bug OTEL upstream issues and bugs
Milestone

Comments

@beegmon
Copy link

beegmon commented Feb 21, 2022

Describe the bug
A Panic SIGSEGV is produced during normal operation

Steps to reproduce
During the normal process of operation, a SEGFAULT and panic is produce causes the OTEL agent to crash. Collector is deployed as a sidecar in an ECS EC2 task, running ECS optimized AWS Linux 2 on ARM64 hardware.

CONFIG (VIA ENV VAR FROM PARAMETER STORE):
receivers:
prometheus:
config:
global:
scrape_interval: 10s
scrape_timeout: 5s
scrape_configs:
- job_name: "client-0"
metrics_path: "/debug/metrics/prometheus"
static_configs:
- targets: [ $PROMETHEUS_LINK_NAME ]
awsecscontainermetrics:
collection_interval: 10s

processors:
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes

exporters:
awsprometheusremotewrite:
endpoint: "
aws_auth:
region: "us-west-2"
service: "aps"
resource_to_telemetry_conversion:
enabled: true
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679

service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [prometheus]
exporters: [logging, awsprometheusremotewrite]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter]
exporters: [logging, awsprometheusremotewrite]

What did you expect to see?
I expect the process not the SEGFAULT or panic during normal operation

What did you see instead?
LogOutput:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2456ba0]
goroutine 143 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetricSummary(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:455 +0x130
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetricPoints(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:283 +0x304
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetric(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:269 +0x134
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).AdjustMetricSlice(0x400032ba10, 0x4001138600)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:235 +0x80
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*transactionPdata).Commit(0x400074e1c0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_transaction.go:150 +0x208
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1(0x400032bd08, 0x400032bd18, 0x400073b040)
github.com/prometheus/[email protected]/scrape/scrape.go:1250 +0x40
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0x400073b040, {0xc07ce8b413ee0de3, 0x15fbf849f5, 0x54ed800}, {0x13f51c5f, 0xed9a5225a, 0x54ed800}, 0x0)
github.com/prometheus/[email protected]/scrape/scrape.go:1321 +0xe0c
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0x400073b040, 0x0)
github.com/prometheus/[email protected]/scrape/scrape.go:1203 +0x2d0
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
github.com/prometheus/[email protected]/scrape/scrape.go:584 +0x8f8

Environment
Collector is running in AWS as a sidecar within a task, on ECS optimized AWS Linux 2 with ARM64 host

Additional context
This doesn't happen immediately, only after 10 min or so of run time.

@beegmon
Copy link
Author

beegmon commented Feb 23, 2022

It looks like this issue may have been in contrib. I am curious when these changes will be pulled into the aws otel agent?

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Feb 23, 2022

If these changes were recently fixed upstream you can expect them to be pulled into the ADOT Collector release v0.18.0. Cu

@Aneurysm9 Aneurysm9 added this to the v0.18.0 milestone Feb 23, 2022
@Aneurysm9 Aneurysm9 added the otel-bug OTEL upstream issues and bugs label Feb 23, 2022
@vsakaram
Copy link

Closing as addressed earlier in the year.

@davetbo-amzn
Copy link

I'm getting this issue now. Is this really fixed? Here's my otel-config with \n replaced by newlines for readability. Note that the \" below are because this was originally a quoted string in the template. Left them here for minimal changes.

receivers:  
  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
      - job_name: \"appmesh-envoy\"
        sample_limit: 10000
        metrics_path: /stats/prometheus
        static_configs:
          - targets: ['0.0.0.0:9901']
  awsecscontainermetrics:
    collection_interval: 15s
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
processors:
  batch/traces:
    timeout: 1s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - ecs.task.memory.utilized
          - ecs.task.memory.reserved
          - ecs.task.memory.usage
          - ecs.task.cpu.utilized
          - ecs.task.cpu.reserved
          - ecs.task.cpu.usage.vcpu
          - ecs.task.network.rate.rx
          - ecs.task.network.rate.tx
          - ecs.task.storage.read_bytes
          - ecs.task.storage.write_bytes
exporters:
  awsxray:
    region: us-east-1
  prometheusremotewrite:
    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
    auth:
      authenticator: sigv4auth
  awsemf:
    namespace: ECS/AWSOtel/Application
    log_group_name: '/ecs/application/metrics/{ClusterName}'
    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_(total|xx)$\"
          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
          - \"^envoy_cluster_membership_(healthy|total)$\"
          - \"^envoy_server_memory_(allocated|heap_size)$\"
          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
          - \"^envoy_cluster_upstream_rq_retry$\"
          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
          - \"^envoy_server_(version|uptime|live)$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_xx$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
  logging:
    loglevel: debug
extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    metrics:
      receivers: [otlp, statsd]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/envoy:
      receivers: [prometheus]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/ecs:
      receivers: [awsecscontainermetrics]
      processors: [filter, batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    traces:
      receivers: [otlp, awsxray]
      processors: [batch/traces]
      exporters: [awsxray]

Here's the way it is in my template:

      Value: !Sub "receivers:  \n  prometheus:\n    config:\n      global:\n        scrape_interval: 1m\n        scrape_timeout: 10s\n      scrape_configs:\n      - job_name: \"appmesh-envoy\"\n        sample_limit: 10000\n        metrics_path: /stats/prometheus\n        static_configs:\n          - targets: ['0.0.0.0:9901']\n  awsecscontainermetrics:\n    collection_interval: 15s\n  otlp:\n    protocols:\n      grpc:\n        endpoint: 0.0.0.0:4317\n      http:\n        endpoint: 0.0.0.0:55681\n  awsxray:\n    endpoint: 0.0.0.0:2000\n    transport: udp\n  statsd:\n    endpoint: 0.0.0.0:8125\n    aggregation_interval: 60s\nprocessors:\n  batch/traces:\n    timeout: 1s\n    send_batch_size: 50\n  batch/metrics:\n    timeout: 60s\n  filter:\n    metrics:\n      include:\n        match_type: strict\n        metric_names:\n          - ecs.task.memory.utilized\n          - ecs.task.memory.reserved\n          - ecs.task.memory.usage\n          - ecs.task.cpu.utilized\n          - ecs.task.cpu.reserved\n          - ecs.task.cpu.usage.vcpu\n          - ecs.task.network.rate.rx\n          - ecs.task.network.rate.tx\n          - ecs.task.storage.read_bytes\n          - ecs.task.storage.write_bytes\nexporters:\n  awsxray:\n    region: us-east-1\n  prometheusremotewrite:\n    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n    auth:\n      authenticator: sigv4auth\n  awsemf:\n    namespace: ECS/AWSOtel/Application\n    log_group_name: '/ecs/application/metrics/{ClusterName}'\n    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n    resource_to_telemetry_conversion:\n      enabled: true\n    dimension_rollup_option: NoDimensionRollup\n    metric_declarations:\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_(total|xx)$\"\n          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n          - \"^envoy_cluster_membership_(healthy|total)$\"\n          - \"^envoy_server_memory_(allocated|heap_size)$\"\n          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n          - \"^envoy_cluster_upstream_rq_retry$\"\n          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n          - \"^envoy_server_(version|uptime|live)$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_xx$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n  logging:\n    loglevel: debug\nextensions:\n  health_check:\n  pprof:\n    endpoint: :1888\n  zpages:\n    endpoint: :55679\nservice:\n  extensions: [pprof, zpages, health_check]\n  pipelines:\n    metrics:\n      receivers: [otlp, statsd]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/envoy:\n      receivers: [prometheus]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/ecs:\n      receivers: [awsecscontainermetrics]\n      processors: [filter, batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    traces:\n      receivers: [otlp, awsxray]\n      processors: [batch/traces]\n      exporters: [awsxray]\n"

Here's the error with the code and memory address:

panic: runtime error: invalid memory address or nil pointer dereference 
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]

Any advice would be greatly appreciated.

@vsakaram
Copy link

Thanks @davetbo-amzn for reaching out with above details, appreciate. Reopening so we could review and update.

@vsakaram vsakaram reopened this Feb 16, 2023
@Aneurysm9
Copy link
Member

@davetbo-amzn can you please include the full stack trace that followed the panic?

@davetbo-amzn
Copy link

davetbo-amzn commented Feb 16, 2023

Here you go. This was the entirety of the fargate/otel/otel-collector* log for this run from CloudWatch. Thanks for taking a look!


2023/02/16 14:46:24 ADOT Collector version: v0.26.1
--
2023/02/16 14:46:24 found no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
2023/02/16 14:46:24 Reading AOT config from environment: receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: "appmesh-envoy"
sample_limit: 10000
metrics_path: /stats/prometheus
static_configs:
- targets: ['0.0.0.0:9901']
awsecscontainermetrics:
collection_interval: 15s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
batch/metrics:
timeout: 60s
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.memory.usage
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.cpu.usage.vcpu
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsxray:
region: us-east-1
prometheusremotewrite:
endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-f87b6940-bfcc-4ec4-b9b1-325189711ad5/api/v1/remote_write
auth:
authenticator: sigv4auth
awsemf:
namespace: ECS/AWSOtel/Application
log_group_name: '/ecs/application/metrics/{ClusterName}'
log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_(total\|xx)$"
- "^envoy_cluster_upstream_cx_(r\|t)x_bytes_total$"
- "^envoy_cluster_membership_(healthy\|total)$"
- "^envoy_server_memory_(allocated\|heap_size)$"
- "^envoy_cluster_upstream_cx_(connect_timeout\|destroy_local_with_active_rq)$"
- "^envoy_cluster_upstream_rq_(pending_failure_eject\|pending_overflow\|timeout\|per_try_timeout\|rx_reset\|maintenance_mode)$"
- "^envoy_http_downstream_cx_destroy_remote_active_rq$"
- "^envoy_cluster_upstream_flow_control_(paused_reading_total\|resumed_reading_total\|backed_up_total\|drained_total)$"
- "^envoy_cluster_upstream_rq_retry$"
- "^envoy_cluster_upstream_rq_retry_(success\|overflow)$"
- "^envoy_server_(version\|uptime\|live)$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
- dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_xx$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [otlp, statsd]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/envoy:
receivers: [prometheus]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter, batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
traces:
receivers: [otlp, awsxray]
processors: [batch/traces]
exporters: [awsxray]
2023-02-16T14:46:24.621Z	info	service/telemetry.go:90	Setting up own telemetry...
2023-02-16T14:46:24.621Z	info	service/telemetry.go:116	Serving Prometheus metrics	{     "address": ":8888",     "level": "Basic" }
2023-02-16T14:46:24.630Z	info	exporter/exporter.go:290	Development component. May change in the future.	{     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.630Z	warn	[email protected]/factory.go:109	'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior.	{     "kind": "exporter",     "data_type": "metrics",     "name": "logging",     "loglevel": "debug",     "equivalent verbosity level": "Detailed" }
2023-02-16T14:46:24.637Z	info	[email protected]/metrics.go:97	Metric filter configured	{     "kind": "processor",     "name": "filter",     "pipeline": "metrics/ecs",     "include match_type": "strict",     "include expressions": [],     "include metric names": [         "ecs.task.memory.utilized",         "ecs.task.memory.reserved",         "ecs.task.memory.usage",         "ecs.task.cpu.utilized",         "ecs.task.cpu.reserved",         "ecs.task.cpu.usage.vcpu",         "ecs.task.network.rate.rx",         "ecs.task.network.rate.tx",         "ecs.task.storage.read_bytes",         "ecs.task.storage.write_bytes"     ],     "include metrics with resource attributes": null,     "exclude match_type": "",     "exclude expressions": [],     "exclude metric names": [],     "exclude metrics with resource attributes": null }
2023-02-16T14:46:24.638Z	info	[email protected]/receiver.go:58	Going to listen on endpoint for X-Ray segments	{     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z	info	udppoller/poller.go:106	Listening on endpoint for X-Ray segments	{     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z	info	[email protected]/receiver.go:69	Listening on endpoint for X-Ray segments	{     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.640Z	info	service/service.go:128	Starting aws-otel-collector...	{     "Version": "v0.26.1",     "NumCPU": 2 }
2023-02-16T14:46:24.640Z	info	extensions/extensions.go:41	Starting extensions...
2023-02-16T14:46:24.640Z	info	extensions/extensions.go:44	Extension is starting...	{     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z	info	[email protected]/zpagesextension.go:64	Registered zPages span processor on tracer provider	{     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z	info	[email protected]/zpagesextension.go:74	Registered Host's zPages	{     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z	info	[email protected]/zpagesextension.go:86	Starting zPages extension	{     "kind": "extension",     "name": "zpages",     "config": {         "TCPAddr": {             "Endpoint": ":55679"         }     } }
2023-02-16T14:46:24.640Z	info	extensions/extensions.go:48	Extension started.	{     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z	info	extensions/extensions.go:44	Extension is starting...	{     "kind": "extension",     "name": "health_check" }
2023-02-16T14:46:24.640Z	info	[email protected]/healthcheckextension.go:45	Starting health_check extension	{     "kind": "extension",     "name": "health_check",     "config": {         "Endpoint": "0.0.0.0:13133",         "TLSSetting": null,         "CORS": null,         "Auth": null,         "MaxRequestBodySize": 0,         "IncludeMetadata": false,         "Path": "/",         "CheckCollectorPipeline": {             "Enabled": false,             "Interval": "5m",             "ExporterFailureThreshold": 5         }     } }
2023-02-16T14:46:24.641Z	warn	internal/warning.go:51	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks	{     "kind": "extension",     "name": "health_check",     "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks" }
2023-02-16T14:46:24.641Z	info	extensions/extensions.go:48	Extension started.	{     "kind": "extension",     "name": "health_check" }
2023-02-16T14:46:24.641Z	info	extensions/extensions.go:44	Extension is starting...	{     "kind": "extension",     "name": "pprof" }
2023-02-16T14:46:24.641Z	info	[email protected]/pprofextension.go:71	Starting net/http/pprof server	{     "kind": "extension",     "name": "pprof",     "config": {         "TCPAddr": {             "Endpoint": ":1888"         },         "BlockProfileFraction": 0,         "MutexProfileFraction": 0,         "SaveToFile": ""     } }
2023-02-16T14:46:24.641Z	info	extensions/extensions.go:48	Extension started.	{     "kind": "extension",     "name": "pprof" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:86	Starting exporters...
2023-02-16T14:46:24.641Z	info	service/pipelines.go:90	Exporter is starting...	{     "kind": "exporter",     "data_type": "traces",     "name": "awsxray" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:94	Exporter started.	{     "kind": "exporter",     "data_type": "traces",     "name": "awsxray" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:90	Exporter is starting...	{     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:94	Exporter started.	{     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:90	Exporter is starting...	{     "kind": "exporter",     "data_type": "metrics",     "name": "prometheusremotewrite" }
2023-02-16T14:46:24.641Z	info	service/service.go:154	Starting shutdown...
2023-02-16T14:46:24.641Z	info	healthcheck/handler.go:129	Health Check state change	{     "kind": "extension",     "name": "health_check",     "status": "unavailable" }
2023-02-16T14:46:24.641Z	info	service/pipelines.go:130	Stopping receivers...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]
goroutine 1 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsecscontainermetricsreceiver.(*awsEcsContainerMetricsReceiver).Shutdown(0x0?, {0x3de05ee?, 0x75e419?})
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/receiver.go:82 +0x16
go.opentelemetry.io/collector/service.(*builtPipelines).ShutdownAll(0xc00082cc80, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/[email protected]/service/pipelines.go:133 +0x499
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc00052a000, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/[email protected]/service/service.go:160 +0xd9
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/[email protected]/otelcol/collector.go:181 +0x5a8
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/[email protected]/otelcol/collector.go:205 +0x65
main.newCommand.func1(0xc00020e900, {0x3db707e?, 0x1?, 0x1?})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:122 +0x267
github.com/spf13/cobra.(*Command).execute(0xc00020e900, {0xc000122010, 0x1, 0x1})
github.com/spf13/[email protected]/command.go:916 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0xc00020e900)
github.com/spf13/[email protected]/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/[email protected]/command.go:968
main.runInteractive({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:84 +0x5e
main.run({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main_others.go:42 +0xf8
main.main()
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:77 +0x2be

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Feb 17, 2023

@davetbo-amzn thanks for the report! I have filed a PR upstream to fix this. I'll leave this open until I can confidently say what version of the ADOT Collector the fix will be a part of.

@bryan-aguilar bryan-aguilar self-assigned this Feb 17, 2023
@davetbo-amzn
Copy link

Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.

If it's possible I have an old version, how would I check my version?

@davetbo-amzn
Copy link

davetbo-amzn commented Feb 17, 2023

This config works:

      Value: !Sub "receivers:  \n  prometheus:\n    config:\n      global:\n        scrape_interval: 1m\n        scrape_timeout: 10s\n      scrape_configs:\n      - job_name: \"appmesh-envoy\"\n        sample_limit: 10000\n        metrics_path: /stats/prometheus\n        static_configs:\n          - targets: ['0.0.0.0:9901']\n  awsecscontainermetrics:\n    collection_interval: 15s\n  otlp:\n    protocols:\n      grpc:\n        endpoint: 0.0.0.0:4317\n      http:\n        endpoint: 0.0.0.0:55681\n  awsxray:\n    endpoint: 0.0.0.0:2000\n    transport: udp\n  statsd:\n    endpoint: 0.0.0.0:8125\n    aggregation_interval: 60s\nprocessors:\n  batch/traces:\n    timeout: 1s\n    send_batch_size: 50\n  batch/metrics:\n    timeout: 60s\n  filter:\n    metrics:\n      include:\n        match_type: strict\n        metric_names:\n          - ecs.task.memory.utilized\n          - ecs.task.memory.reserved\n          - ecs.task.memory.usage\n          - ecs.task.cpu.utilized\n          - ecs.task.cpu.reserved\n          - ecs.task.cpu.usage.vcpu\n          - ecs.task.network.rate.rx\n          - ecs.task.network.rate.tx\n          - ecs.task.storage.read_bytes\n          - ecs.task.storage.write_bytes\nexporters:\n  awsxray:\n  prometheusremotewrite:\n    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n    resource_to_telemetry_conversion:\n      enabled: true\n  awsemf:\n    namespace: ECS/AWSOtel/Application\n    log_group_name: '/ecs/application/metrics/{ClusterName}'\n    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n    resource_to_telemetry_conversion:\n      enabled: true\n    dimension_rollup_option: NoDimensionRollup\n    metric_declarations:\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_(total|xx)$\"\n          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n          - \"^envoy_cluster_membership_(healthy|total)$\"\n          - \"^envoy_server_memory_(allocated|heap_size)$\"\n          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n          - \"^envoy_cluster_upstream_rq_retry$\"\n          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n          - \"^envoy_server_(version|uptime|live)$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_xx$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n  logging:\n    loglevel: debug\nextensions:\n  health_check:\n  pprof:\n    endpoint: :1888\n  zpages:\n    endpoint: :55679\nservice:\n  extensions: [pprof, zpages, health_check]\n  pipelines:\n    metrics:\n      receivers: [otlp, statsd]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/envoy:\n      receivers: [prometheus]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/ecs:\n      receivers: [awsecscontainermetrics]\n      processors: [filter, batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    traces:\n      receivers: [otlp, awsxray]\n      processors: [batch/traces]\n      exporters: [awsxray]\n"

Or presented with the \n turned into newlines:

receivers:  
  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
      - job_name: \"appmesh-envoy\"
        sample_limit: 10000
        metrics_path: /stats/prometheus
        static_configs:
          - targets: ['0.0.0.0:9901']
  awsecscontainermetrics:
    collection_interval: 15s
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
processors:
  batch/traces:
    timeout: 1s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - ecs.task.memory.utilized
          - ecs.task.memory.reserved
          - ecs.task.memory.usage
          - ecs.task.cpu.utilized
          - ecs.task.cpu.reserved
          - ecs.task.cpu.usage.vcpu
          - ecs.task.network.rate.rx
          - ecs.task.network.rate.tx
          - ecs.task.storage.read_bytes
          - ecs.task.storage.write_bytes
exporters:
  awsxray:
  prometheusremotewrite:
    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
    resource_to_telemetry_conversion:
      enabled: true
  awsemf:
    namespace: ECS/AWSOtel/Application
    log_group_name: '/ecs/application/metrics/{ClusterName}'
    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_(total|xx)$\"
          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
          - \"^envoy_cluster_membership_(healthy|total)$\"
          - \"^envoy_server_memory_(allocated|heap_size)$\"
          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
          - \"^envoy_cluster_upstream_rq_retry$\"
          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
          - \"^envoy_server_(version|uptime|live)$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_xx$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
  logging:
    loglevel: debug
extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    metrics:
      receivers: [otlp, statsd]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/envoy:
      receivers: [prometheus]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/ecs:
      receivers: [awsecscontainermetrics]
      processors: [filter, batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    traces:
      receivers: [otlp, awsxray]
      processors: [batch/traces]
      exporters: [awsxray]

Here's the diff:

diff old.yml new.yml
50d49
<     region: us-east-1
53,54c52,53
<     auth:
<       authenticator: sigv4auth
---
>     resource_to_telemetry_conversion:
>       enabled: true
113a113
> "

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Feb 17, 2023

The sigsegv you reported was due to an unchecked nil value in the shutdown process of awsecscontainermetrics receiver. The original report was an error in prometheus receiver. They do not appear related other than both being segmentation faults.

@Aneurysm9
Copy link
Member

Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.

Yes, this was a different issue. Or, rather, a different instance of the same class of issue. The original report related to metric adjustment in the Prometheus receiver that failed to check whether a pointer was nil prior to using it. Your issue related to shutdown of the awscontainerinsights receiver failing to check whether a function pointer was nil prior to using it.

If it's possible I have an old version, how would I check my version?

You can see your version in the logs:

2023-02-16T14:46:24.640Z	info	service/service.go:128	Starting aws-otel-collector...	{     "Version": "v0.26.1",     "NumCPU": 2 }

@davetbo-amzn
Copy link

Thanks for the quick responses, all!

codeboten pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Feb 21, 2023
…tdown (#18736)

Fix possible sigsev error that could occur during shutdown if component was not correctly initialized.

Link to tracking Issue: Reported in ADOT downstream repository aws-observability/aws-otel-collector#982 (comment)
@vsakaram
Copy link

Update: Fix is merged as part of upstream collector release v0.72 and would be available as part of next ADOT collector release in about a week.

@vsakaram
Copy link

vsakaram commented Mar 7, 2023

@davetbo-amzn we have released ADOT collector (https://aws-otel.github.io/docs/ReleaseBlogs/aws-distro-for-opentelemetry-collector-v0.27.0) earlier this week addressing this issue.

@davetbo-amzn
Copy link

That seems to have resolved the error. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
otel-bug OTEL upstream issues and bugs
Projects
None yet
Development

No branches or pull requests

5 participants