-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race in memqueue leads to panic: sync: negative WaitGroup counter #37702
Comments
We have the same issue after upgrading to Filebeat |
We are seeing this in v8.12.0 and also v8.11.4. Probably related to changes that went into 8.11.3 ?https://www.elastic.co/guide/en/beats/libbeat/8.12/release-notes-8.11.3.html |
That changelog looks like a mistake in the docs backporting... just confirmed in the source, those flag changes didn't take effect until 8.12. The queue shows up in the stack trace, because the panic happens in an acknowledgment callback forwarded by the queue, but the actual panic is in Filebeat, which uses a WaitGroup to track all outstanding events in the pipeline. So it's not clear yet that the queue itself is part of the problem -- a bookkeeping error at any stage of the pipeline could cause this sort of failure. I'm investigating now. |
@faec we had no issues with 8.9.0
|
I am doing some testing and this issue appears to be in v8.12.0, v8.11.4, v8.11.3 and v8.11.2. |
@kbujold Thanks! (And thanks for the config!) Have you tried v8.11.1 or earlier? I haven't found any promising causes in 8.11.2 and I'm wondering how much I should broaden my search. |
@faec we are not seeing the issue in V8.11.1 so fare. The intent was to update our product to the latest ELK release so please keep us updated as we cannot release with this bug. |
@faec, this is our Filebeat config: # cat /etc/filebeat/filebeat.yml
filebeat.registry.file: /var/lib/filebeat/registry
# Monitoring http endpoint
http:
enabled: true
host: localhost
port: 5066
filebeat.config.inputs:
enabled: true
path: inputs.d/*
setup:
ilm.enabled: false
template:
enabled: false
overwrite: false
logging.metrics.enabled: false
processors:
- add_fields:
target: ''
fields:
env: prod
- add_cloud_metadata: ~
- drop_fields:
fields:
- agent.ephemeral_id
- agent.hostname
- agent.id
- agent.name
- agent.type
- cloud.account
- cloud.image
- cloud.machine
- cloud.provider
- cloud.region
- cloud.service
- ecs
- input
- input.type
- log.file.path
- log.offset
- stream
output.elasticsearch:
hosts: ["es.my-domain.com:9200"]
compression_level: 1
indices:
- index: "syslog-%{[agent.version]}"
when.and:
- has_fields: ['type']
- equals.type: 'syslog'
- index: "kernel-%{[agent.version]}"
when.and:
- has_fields: [ 'type' ]
- equals.type: 'kernel'
- index: "cloud-init-%{[agent.version]}"
when.and:
- has_fields: ['type']
- equals.type: 'cloud-init'
- index: "my-app-%{[agent.version]}"
when.not.and:
- has_fields: ['type']
- or:
- equals.type: 'syslog'
- equals.type: 'kernel'
- equals.type: 'cloud-init'
# Increased "queue" and "bulk_max_size" reduced the number of "panic: sync: negative WaitGroup counter"
# on hosts with a high rate of logs, e.g. 1200 logs/sec and more
output.elasticsearch.bulk_max_size: 2400
output.elasticsearch.worker: 1
queue.mem.events: 4800
queue.mem.flush.min_events: 2400 input configs: # cat /etc/filebeat/inputs.d/app-log.yml
- type: filestream
id: app-log
paths:
- /home/ubuntu/my-app/logs/*-json.log
parsers:
- ndjson:
ignore_decoding_error: true
processors:
- timestamp:
field: timestamp
layouts:
- '2006-01-02T15:04:05Z'
- '2006-01-02T15:04:05.999Z'
- '2006-01-02T15:04:05.999-07:00'
test:
- '2023-04-16T17:45:35.999Z'
ignore_missing: true
- drop_fields:
fields: [timestamp]
ignore_missing: true
---
# cat /etc/filebeat/inputs.d/docker.yml
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
json:
keys_under_root: true
ignore_decoding_error: true
---
# cat /etc/filebeat/inputs.d/syslog.yml
- type: filestream
id: syslog
paths:
- '/var/log/syslog*'
processors:
- add_fields:
target: ''
fields:
type: 'syslog'
---
# cat /etc/filebeat/inputs.d/kernel.yml
- type: filestream
id: kernel
paths:
- '/var/log/kern.log'
processors:
- add_fields:
target: ''
fields:
type: 'kernel'
---
# cat /etc/filebeat/inputs.d/cloud-init.yml
- type: filestream
id: cloud-init
paths:
- '/var/log/cloud-init-output.log'
ignore_older: '2h'
close_inactive: '10m'
exclude_lines: ['^\|', '^\+']
parsers:
- multiline:
type: pattern
pattern: '^ci-info.*'
negate: false
match: after
processors:
- add_fields:
target: ''
fields:
type: 'cloud-init' We didn't have this issue with Filebeat |
We have this issue with 8.11.3 fleet managed elastic-agent running the kubernetes integration |
I believe I've found the cause. It's a subtle side effect of #37077, which is supposed to keep producers from getting stuck when the queue shuts down. However, as written, the cancellation is triggered on shutdown of the queue producer rather than the queue itself. This means that if a producer is closed while it is waiting for a response from the queue, it can return failure even though the queue insert was successful (just incomplete). This makes it possible for that event to receive two "done" signals, one from the cancelled insert, and one when it is correctly acknowledged upstream. I'm preparing a fix. |
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
… closed (#38094) (#38178) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) Co-authored-by: Fae Charlton <[email protected]>
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
…eue closed, not producer closed (#38177) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport --------- Co-authored-by: Fae Charlton <[email protected]>
…eue closed, not producer closed (#38279) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport * fix NewQueue call --------- Co-authored-by: Fae Charlton <[email protected]>
The text was updated successfully, but these errors were encountered: