-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic and SIGSEGV #982
Comments
It looks like this issue may have been in contrib. I am curious when these changes will be pulled into the aws otel agent? |
If these changes were recently fixed upstream you can expect them to be pulled into the ADOT Collector release |
Closing as addressed earlier in the year. |
I'm getting this issue now. Is this really fixed? Here's my otel-config with \n replaced by newlines for readability. Note that the \" below are because this was originally a quoted string in the template. Left them here for minimal changes.
Here's the way it is in my template:
Here's the error with the code and memory address:
Any advice would be greatly appreciated. |
Thanks @davetbo-amzn for reaching out with above details, appreciate. Reopening so we could review and update. |
@davetbo-amzn can you please include the full stack trace that followed the panic? |
Here you go. This was the entirety of the fargate/otel/otel-collector* log for this run from CloudWatch. Thanks for taking a look!
|
@davetbo-amzn thanks for the report! I have filed a PR upstream to fix this. I'll leave this open until I can confidently say what version of the ADOT Collector the fix will be a part of. |
Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up. If it's possible I have an old version, how would I check my version? |
This config works:
Or presented with the \n turned into newlines:
Here's the diff:
|
The sigsegv you reported was due to an unchecked nil value in the shutdown process of |
Yes, this was a different issue. Or, rather, a different instance of the same class of issue. The original report related to metric adjustment in the Prometheus receiver that failed to check whether a pointer was
You can see your version in the logs:
|
Thanks for the quick responses, all! |
…tdown (#18736) Fix possible sigsev error that could occur during shutdown if component was not correctly initialized. Link to tracking Issue: Reported in ADOT downstream repository aws-observability/aws-otel-collector#982 (comment)
@davetbo-amzn we have released ADOT collector (https://aws-otel.github.io/docs/ReleaseBlogs/aws-distro-for-opentelemetry-collector-v0.27.0) earlier this week addressing this issue. |
That seems to have resolved the error. Thanks! |
Describe the bug
A Panic SIGSEGV is produced during normal operation
Steps to reproduce
During the normal process of operation, a SEGFAULT and panic is produce causes the OTEL agent to crash. Collector is deployed as a sidecar in an ECS EC2 task, running ECS optimized AWS Linux 2 on ARM64 hardware.
CONFIG (VIA ENV VAR FROM PARAMETER STORE):
receivers:
prometheus:
config:
global:
scrape_interval: 10s
scrape_timeout: 5s
scrape_configs:
- job_name: "client-0"
metrics_path: "/debug/metrics/prometheus"
static_configs:
- targets: [ $PROMETHEUS_LINK_NAME ]
awsecscontainermetrics:
collection_interval: 10s
processors:
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsprometheusremotewrite:
endpoint: "
aws_auth:
region: "us-west-2"
service: "aps"
resource_to_telemetry_conversion:
enabled: true
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [prometheus]
exporters: [logging, awsprometheusremotewrite]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter]
exporters: [logging, awsprometheusremotewrite]
What did you expect to see?
I expect the process not the SEGFAULT or panic during normal operation
What did you see instead?
LogOutput:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2456ba0]
goroutine 143 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetricSummary(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:455 +0x130
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetricPoints(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:283 +0x304
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).adjustMetric(0x400032ba10, 0x4000412fd0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:269 +0x134
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*MetricsAdjusterPdata).AdjustMetricSlice(0x400032ba10, 0x4001138600)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_metrics_adjuster.go:235 +0x80
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(*transactionPdata).Commit(0x400074e1c0)
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/otlp_transaction.go:150 +0x208
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1(0x400032bd08, 0x400032bd18, 0x400073b040)
github.com/prometheus/[email protected]/scrape/scrape.go:1250 +0x40
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0x400073b040, {0xc07ce8b413ee0de3, 0x15fbf849f5, 0x54ed800}, {0x13f51c5f, 0xed9a5225a, 0x54ed800}, 0x0)
github.com/prometheus/[email protected]/scrape/scrape.go:1321 +0xe0c
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0x400073b040, 0x0)
github.com/prometheus/[email protected]/scrape/scrape.go:1203 +0x2d0
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
github.com/prometheus/[email protected]/scrape/scrape.go:584 +0x8f8
Environment
Collector is running in AWS as a sidecar within a task, on ECS optimized AWS Linux 2 with ARM64 host
Additional context
This doesn't happen immediately, only after 10 min or so of run time.
The text was updated successfully, but these errors were encountered: