Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

michalpristas · 2024-01-23T09:36:36Z

panic: sync: negative WaitGroup counter

goroutine 61 [running]:
sync.(*WaitGroup).Add(0xc00046c4e0?, 0xc001185000?)
	sync/waitgroup.go:62 +0xe5
sync.(*WaitGroup).Done(...)
	sync/waitgroup.go:87
github.com/elastic/beats/v7/filebeat/beater.(*eventCounter).Done(0x4bd3620?)
	github.com/elastic/beats/v7/filebeat/beater/channels.go:103 +0x45
github.com/elastic/beats/v7/filebeat/beater.(*finishedLogger).Published(0xc00019e240, 0x25)
	github.com/elastic/beats/v7/filebeat/beater/channels.go:89 +0x32
github.com/elastic/beats/v7/filebeat/beater.eventACKer.func1(0x4d0e414?, {0xc002cee720, 0x25, 0xa169b0?})
	github.com/elastic/beats/v7/filebeat/beater/acker.go:65 +0x422
github.com/elastic/beats/v7/libbeat/common/acker.(*eventDataACKer).onACK(0xc00100e280, 0xa3d200?, 0x25)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:257 +0x134
github.com/elastic/beats/v7/libbeat/common/acker.(*trackingACKer).ACKEvents(0xc001a6cf00, 0x25)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:206 +0x203
github.com/elastic/beats/v7/libbeat/common/acker.ackerList.ACKEvents(...)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:294
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).processACK.func1()
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/eventloop.go:515 +0x24
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).processACK(0xc000333170, {0x0, 0x0}, 0x25)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/eventloop.go:525 +0x188
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).handleBatchSig(0xc000ffd860)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/ackloop.go:73 +0xa4
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).run(0xc000ffd860)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/ackloop.go:52 +0x187
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue.func2()
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/broker.go:201 +0x65
created by github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/broker.go:199 +0x52d

The text was updated successfully, but these errors were encountered:

oshmyrko · 2024-01-24T11:28:26Z

We have the same issue after upgrading to Filebeat v8.11.3. Just reported it on discuss.elastic.co.

kbujold · 2024-01-24T15:19:44Z

We are seeing this in v8.12.0 and also v8.11.4. Probably related to changes that went into 8.11.3 ?https://www.elastic.co/guide/en/beats/libbeat/8.12/release-notes-8.11.3.html
57df903
#36990

faec · 2024-01-24T17:54:09Z

Probably related to changes that went into 8.11.3

That changelog looks like a mistake in the docs backporting... just confirmed in the source, those flag changes didn't take effect until 8.12.

The queue shows up in the stack trace, because the panic happens in an acknowledgment callback forwarded by the queue, but the actual panic is in Filebeat, which uses a WaitGroup to track all outstanding events in the pipeline. So it's not clear yet that the queue itself is part of the problem -- a bookkeeping error at any stage of the pipeline could cause this sort of failure. I'm investigating now.

faec · 2024-01-24T18:03:50Z

I'd be curious to know what input and output types people were using when they saw this error (cc @oshmyrko @kbujold)

kbujold · 2024-01-24T21:10:15Z

@faec we had no issues with 8.9.0
This is our filebeat config

cat filebeat.yml
fields:
  system:
    name: test
    uid: 22aafb31-1234-aaaa-2222-18a2db1f7ebb
fields_under_root: true
filebeat.autodiscover:
  providers:
  - add_resource_metadata:
      cronjob: false
      deployment: false
    hints.default_config:
      close.on_state_change.renamed: true
      id: filestream-kubernetes-pod-${data.kubernetes.container.id}
      parsers:
      - container: null
      paths:
      - /var/log/containers/*-${data.kubernetes.container.id}.log
      prospector.scanner.symlinks: true
      type: filestream
    hints.enabled: true
    host: ${NODE_NAME}
    type: kubernetes
filebeat.inputs:
- close.reader.after_interval: 5m
  id: wra-filestream-id
  paths:
  - /var/log/*.log
  - /var/log/messages
  - /var/log/syslog
  - /var/log/**/*.log
  prospector.scanner.exclude_files:
  - ^/var/log/containers/
  - ^/var/log/pods/
  type: filestream
http.port: 5066
logging.level: warning
monitoring:
  cluster_uuid: ${CLUSTER_UUID}
  elasticsearch:
    hosts:
    - https://mon-elasticsearch-client:9200
    password: ${beats_system_monitoring_password}
    ssl.certificate_authorities:
    - /usr/share/filebeat/ext-ca.crt
    username: ${beats_system_monitoring_user}
  enabled: ${BEAT_MONITORING_ENABLED}
name: ${NODE_NAME}
output.elasticsearch:
  enabled: false
  host: ${NODE_NAME}
  hosts:
  - https://mon-elasticsearch-client:9200
  ilm.pattern: '000001'
  index: ${INDEX_NAME}-%{+yyyy.MM.dd}
  password: ${ELASTICSEARCH_PASSWORD}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  username: ${ELASTICSEARCH_USERNAME}
output.file:
  enabled: false
output.logstash:
  enabled: true
  hosts:
  - mon-logstash:5044
  ssl.certificate: /usr/share/filebeat/config/instance/filebeat.crt
  ssl.certificate_authorities:
  - /usr/share/filebeat/ca.crt
  - /usr/share/filebeat/previous/ca.crt
  - /usr/share/filebeat/next/ca.crt
  ssl.key: /usr/share/filebeat/config/instance/filebeat.key
  timeout: 9
setup.dashboards:
  enabled: false
setup.kibana:
  host: mon-kibana:5601
  password: ${filebeat_password}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  ssl.verification_mode: none
  username: ${filebeat_user}
setup.template:
  name: ${INDEX_NAME}
  pattern: ${INDEX_PATTERN}

kbujold · 2024-01-25T21:10:51Z

I am doing some testing and this issue appears to be in v8.12.0, v8.11.4, v8.11.3 and v8.11.2.

faec · 2024-01-26T19:27:28Z

I am doing some testing and this issue appears to be in v8.12.0, v8.11.4, v8.11.3 and v8.11.2.

@kbujold Thanks! (And thanks for the config!) Have you tried v8.11.1 or earlier? I haven't found any promising causes in 8.11.2 and I'm wondering how much I should broaden my search.

kbujold · 2024-01-26T19:57:24Z

@faec we are not seeing the issue in V8.11.1 so fare. The intent was to update our product to the latest ELK release so please keep us updated as we cannot release with this bug.

oshmyrko · 2024-02-01T09:57:41Z

@faec, this is our Filebeat config:

# cat /etc/filebeat/filebeat.yml 
filebeat.registry.file: /var/lib/filebeat/registry

# Monitoring http endpoint
http:
  enabled: true
  host: localhost
  port: 5066

filebeat.config.inputs:
  enabled: true
  path: inputs.d/*

setup:
  ilm.enabled: false
  template:
    enabled: false
    overwrite: false

logging.metrics.enabled: false

processors:
  - add_fields:
      target: ''
      fields:
        env: prod
  - add_cloud_metadata: ~
  - drop_fields:
      fields:
      - agent.ephemeral_id
      - agent.hostname
      - agent.id
      - agent.name
      - agent.type
      - cloud.account
      - cloud.image
      - cloud.machine
      - cloud.provider
      - cloud.region
      - cloud.service
      - ecs
      - input
      - input.type
      - log.file.path
      - log.offset
      - stream

output.elasticsearch:
  hosts: ["es.my-domain.com:9200"]
  compression_level: 1
  indices:
    - index: "syslog-%{[agent.version]}"
      when.and:
        - has_fields: ['type']
        - equals.type: 'syslog'
    - index: "kernel-%{[agent.version]}"
      when.and:
        - has_fields: [ 'type' ]
        - equals.type: 'kernel'
    - index: "cloud-init-%{[agent.version]}"
      when.and:
        - has_fields: ['type']
        - equals.type: 'cloud-init'
    - index: "my-app-%{[agent.version]}"
      when.not.and:
        - has_fields: ['type']
        - or:
          - equals.type: 'syslog'
          - equals.type: 'kernel'
          - equals.type: 'cloud-init'

# Increased "queue" and "bulk_max_size" reduced the number of "panic: sync: negative WaitGroup counter"
# on hosts with a high rate of logs, e.g. 1200 logs/sec and more
output.elasticsearch.bulk_max_size: 2400
output.elasticsearch.worker: 1
queue.mem.events: 4800
queue.mem.flush.min_events: 2400

input configs:

# cat /etc/filebeat/inputs.d/app-log.yml 
- type: filestream
  id: app-log
  paths:
    - /home/ubuntu/my-app/logs/*-json.log
  parsers:
    - ndjson:
        ignore_decoding_error: true
  processors:
    - timestamp:
        field: timestamp
        layouts:
          - '2006-01-02T15:04:05Z'
          - '2006-01-02T15:04:05.999Z'
          - '2006-01-02T15:04:05.999-07:00'
        test:
          - '2023-04-16T17:45:35.999Z'
        ignore_missing: true
    - drop_fields:
        fields: [timestamp]
        ignore_missing: true

---
# cat /etc/filebeat/inputs.d/docker.yml 
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  json:
    keys_under_root: true
    ignore_decoding_error: true

---
# cat /etc/filebeat/inputs.d/syslog.yml 
- type: filestream
  id: syslog
  paths:
    - '/var/log/syslog*'
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'syslog'

---
# cat /etc/filebeat/inputs.d/kernel.yml 
- type: filestream
  id: kernel
  paths:
    - '/var/log/kern.log'
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'kernel'

---
# cat /etc/filebeat/inputs.d/cloud-init.yml 
- type: filestream
  id: cloud-init
  paths:
    - '/var/log/cloud-init-output.log'
  ignore_older: '2h'
  close_inactive: '10m'
  exclude_lines: ['^\|', '^\+']
  parsers:
    - multiline:
        type: pattern
        pattern: '^ci-info.*'
        negate: false
        match: after
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'cloud-init'

We didn't have this issue with Filebeat 8.6.1.

erlandan · 2024-02-08T14:29:18Z

We have this issue with 8.11.3 fleet managed elastic-agent running the kubernetes integration

faec · 2024-02-21T20:46:02Z

I believe I've found the cause. It's a subtle side effect of #37077, which is supposed to keep producers from getting stuck when the queue shuts down. However, as written, the cancellation is triggered on shutdown of the queue producer rather than the queue itself. This means that if a producer is closed while it is waiting for a response from the queue, it can return failure even though the queue insert was successful (just incomplete). This makes it possible for that event to receive two "done" signals, one from the cancelled insert, and one when it is correctly acknowledged upstream. I'm preparing a fix.

… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)

… closed (#38094) (#38178) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) Co-authored-by: Fae Charlton <[email protected]>

… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)

…eue closed, not producer closed (#38177) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport --------- Co-authored-by: Fae Charlton <[email protected]>

…eue closed, not producer closed (#38279) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport * fix NewQueue call --------- Co-authored-by: Fae Charlton <[email protected]>

michalpristas added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 23, 2024

cmacknz added the Team:Elastic-Agent Label for the Agent team label Jan 24, 2024

pierrehilbert assigned pierrehilbert and faec and unassigned pierrehilbert Jan 24, 2024

This was referenced Feb 21, 2024

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

Merged

Dropped events in filestream / kubernetes autodiscover #38095

Open

faec closed this as completed in #38094 Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

michalpristas commented Jan 23, 2024 •

edited by rdner

Loading

oshmyrko commented Jan 24, 2024

kbujold commented Jan 24, 2024 •

edited

Loading

faec commented Jan 24, 2024

faec commented Jan 24, 2024

kbujold commented Jan 24, 2024

kbujold commented Jan 25, 2024

faec commented Jan 26, 2024

kbujold commented Jan 26, 2024

oshmyrko commented Feb 1, 2024

erlandan commented Feb 8, 2024 •

edited

Loading

faec commented Feb 21, 2024

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

Comments

michalpristas commented Jan 23, 2024 • edited by rdner Loading

oshmyrko commented Jan 24, 2024

kbujold commented Jan 24, 2024 • edited Loading

faec commented Jan 24, 2024

faec commented Jan 24, 2024

kbujold commented Jan 24, 2024

kbujold commented Jan 25, 2024

faec commented Jan 26, 2024

kbujold commented Jan 26, 2024

oshmyrko commented Feb 1, 2024

erlandan commented Feb 8, 2024 • edited Loading

faec commented Feb 21, 2024

michalpristas commented Jan 23, 2024 •

edited by rdner

Loading

kbujold commented Jan 24, 2024 •

edited

Loading

erlandan commented Feb 8, 2024 •

edited

Loading