Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

Closed
michalpristas opened this issue Jan 23, 2024 · 11 comments · Fixed by #38094
Closed

Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

michalpristas opened this issue Jan 23, 2024 · 11 comments · Fixed by #38094
Assignees
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@michalpristas
Copy link
Contributor

michalpristas commented Jan 23, 2024

panic: sync: negative WaitGroup counter

goroutine 61 [running]:
sync.(*WaitGroup).Add(0xc00046c4e0?, 0xc001185000?)
	sync/waitgroup.go:62 +0xe5
sync.(*WaitGroup).Done(...)
	sync/waitgroup.go:87
github.com/elastic/beats/v7/filebeat/beater.(*eventCounter).Done(0x4bd3620?)
	github.com/elastic/beats/v7/filebeat/beater/channels.go:103 +0x45
github.com/elastic/beats/v7/filebeat/beater.(*finishedLogger).Published(0xc00019e240, 0x25)
	github.com/elastic/beats/v7/filebeat/beater/channels.go:89 +0x32
github.com/elastic/beats/v7/filebeat/beater.eventACKer.func1(0x4d0e414?, {0xc002cee720, 0x25, 0xa169b0?})
	github.com/elastic/beats/v7/filebeat/beater/acker.go:65 +0x422
github.com/elastic/beats/v7/libbeat/common/acker.(*eventDataACKer).onACK(0xc00100e280, 0xa3d200?, 0x25)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:257 +0x134
github.com/elastic/beats/v7/libbeat/common/acker.(*trackingACKer).ACKEvents(0xc001a6cf00, 0x25)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:206 +0x203
github.com/elastic/beats/v7/libbeat/common/acker.ackerList.ACKEvents(...)
	github.com/elastic/beats/v7/libbeat/common/acker/acker.go:294
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).processACK.func1()
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/eventloop.go:515 +0x24
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).processACK(0xc000333170, {0x0, 0x0}, 0x25)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/eventloop.go:525 +0x188
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).handleBatchSig(0xc000ffd860)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/ackloop.go:73 +0xa4
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).run(0xc000ffd860)
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/ackloop.go:52 +0x187
github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue.func2()
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/broker.go:201 +0x65
created by github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.NewQueue
	github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue/broker.go:199 +0x52d
@michalpristas michalpristas added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 23, 2024
@oshmyrko
Copy link

We have the same issue after upgrading to Filebeat v8.11.3. Just reported it on discuss.elastic.co.

@kbujold
Copy link

kbujold commented Jan 24, 2024

We are seeing this in v8.12.0 and also v8.11.4. Probably related to changes that went into 8.11.3 ?https://www.elastic.co/guide/en/beats/libbeat/8.12/release-notes-8.11.3.html
57df903
#36990

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Jan 24, 2024
@faec
Copy link
Contributor

faec commented Jan 24, 2024

Probably related to changes that went into 8.11.3

That changelog looks like a mistake in the docs backporting... just confirmed in the source, those flag changes didn't take effect until 8.12.

The queue shows up in the stack trace, because the panic happens in an acknowledgment callback forwarded by the queue, but the actual panic is in Filebeat, which uses a WaitGroup to track all outstanding events in the pipeline. So it's not clear yet that the queue itself is part of the problem -- a bookkeeping error at any stage of the pipeline could cause this sort of failure. I'm investigating now.

@faec
Copy link
Contributor

faec commented Jan 24, 2024

I'd be curious to know what input and output types people were using when they saw this error (cc @oshmyrko @kbujold)

@kbujold
Copy link

kbujold commented Jan 24, 2024

@faec we had no issues with 8.9.0
This is our filebeat config

cat filebeat.yml
fields:
  system:
    name: test
    uid: 22aafb31-1234-aaaa-2222-18a2db1f7ebb
fields_under_root: true
filebeat.autodiscover:
  providers:
  - add_resource_metadata:
      cronjob: false
      deployment: false
    hints.default_config:
      close.on_state_change.renamed: true
      id: filestream-kubernetes-pod-${data.kubernetes.container.id}
      parsers:
      - container: null
      paths:
      - /var/log/containers/*-${data.kubernetes.container.id}.log
      prospector.scanner.symlinks: true
      type: filestream
    hints.enabled: true
    host: ${NODE_NAME}
    type: kubernetes
filebeat.inputs:
- close.reader.after_interval: 5m
  id: wra-filestream-id
  paths:
  - /var/log/*.log
  - /var/log/messages
  - /var/log/syslog
  - /var/log/**/*.log
  prospector.scanner.exclude_files:
  - ^/var/log/containers/
  - ^/var/log/pods/
  type: filestream
http.port: 5066
logging.level: warning
monitoring:
  cluster_uuid: ${CLUSTER_UUID}
  elasticsearch:
    hosts:
    - https://mon-elasticsearch-client:9200
    password: ${beats_system_monitoring_password}
    ssl.certificate_authorities:
    - /usr/share/filebeat/ext-ca.crt
    username: ${beats_system_monitoring_user}
  enabled: ${BEAT_MONITORING_ENABLED}
name: ${NODE_NAME}
output.elasticsearch:
  enabled: false
  host: ${NODE_NAME}
  hosts:
  - https://mon-elasticsearch-client:9200
  ilm.pattern: '000001'
  index: ${INDEX_NAME}-%{+yyyy.MM.dd}
  password: ${ELASTICSEARCH_PASSWORD}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  username: ${ELASTICSEARCH_USERNAME}
output.file:
  enabled: false
output.logstash:
  enabled: true
  hosts:
  - mon-logstash:5044
  ssl.certificate: /usr/share/filebeat/config/instance/filebeat.crt
  ssl.certificate_authorities:
  - /usr/share/filebeat/ca.crt
  - /usr/share/filebeat/previous/ca.crt
  - /usr/share/filebeat/next/ca.crt
  ssl.key: /usr/share/filebeat/config/instance/filebeat.key
  timeout: 9
setup.dashboards:
  enabled: false
setup.kibana:
  host: mon-kibana:5601
  password: ${filebeat_password}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  ssl.verification_mode: none
  username: ${filebeat_user}
setup.template:
  name: ${INDEX_NAME}
  pattern: ${INDEX_PATTERN}

@kbujold
Copy link

kbujold commented Jan 25, 2024

I am doing some testing and this issue appears to be in v8.12.0, v8.11.4, v8.11.3 and v8.11.2.

@faec
Copy link
Contributor

faec commented Jan 26, 2024

I am doing some testing and this issue appears to be in v8.12.0, v8.11.4, v8.11.3 and v8.11.2.

@kbujold Thanks! (And thanks for the config!) Have you tried v8.11.1 or earlier? I haven't found any promising causes in 8.11.2 and I'm wondering how much I should broaden my search.

@kbujold
Copy link

kbujold commented Jan 26, 2024

@faec we are not seeing the issue in V8.11.1 so fare. The intent was to update our product to the latest ELK release so please keep us updated as we cannot release with this bug.

@oshmyrko
Copy link

oshmyrko commented Feb 1, 2024

@faec, this is our Filebeat config:

# cat /etc/filebeat/filebeat.yml 
filebeat.registry.file: /var/lib/filebeat/registry

# Monitoring http endpoint
http:
  enabled: true
  host: localhost
  port: 5066

filebeat.config.inputs:
  enabled: true
  path: inputs.d/*

setup:
  ilm.enabled: false
  template:
    enabled: false
    overwrite: false

logging.metrics.enabled: false

processors:
  - add_fields:
      target: ''
      fields:
        env: prod
  - add_cloud_metadata: ~
  - drop_fields:
      fields:
      - agent.ephemeral_id
      - agent.hostname
      - agent.id
      - agent.name
      - agent.type
      - cloud.account
      - cloud.image
      - cloud.machine
      - cloud.provider
      - cloud.region
      - cloud.service
      - ecs
      - input
      - input.type
      - log.file.path
      - log.offset
      - stream

output.elasticsearch:
  hosts: ["es.my-domain.com:9200"]
  compression_level: 1
  indices:
    - index: "syslog-%{[agent.version]}"
      when.and:
        - has_fields: ['type']
        - equals.type: 'syslog'
    - index: "kernel-%{[agent.version]}"
      when.and:
        - has_fields: [ 'type' ]
        - equals.type: 'kernel'
    - index: "cloud-init-%{[agent.version]}"
      when.and:
        - has_fields: ['type']
        - equals.type: 'cloud-init'
    - index: "my-app-%{[agent.version]}"
      when.not.and:
        - has_fields: ['type']
        - or:
          - equals.type: 'syslog'
          - equals.type: 'kernel'
          - equals.type: 'cloud-init'

# Increased "queue" and "bulk_max_size" reduced the number of "panic: sync: negative WaitGroup counter"
# on hosts with a high rate of logs, e.g. 1200 logs/sec and more
output.elasticsearch.bulk_max_size: 2400
output.elasticsearch.worker: 1
queue.mem.events: 4800
queue.mem.flush.min_events: 2400

input configs:

# cat /etc/filebeat/inputs.d/app-log.yml 
- type: filestream
  id: app-log
  paths:
    - /home/ubuntu/my-app/logs/*-json.log
  parsers:
    - ndjson:
        ignore_decoding_error: true
  processors:
    - timestamp:
        field: timestamp
        layouts:
          - '2006-01-02T15:04:05Z'
          - '2006-01-02T15:04:05.999Z'
          - '2006-01-02T15:04:05.999-07:00'
        test:
          - '2023-04-16T17:45:35.999Z'
        ignore_missing: true
    - drop_fields:
        fields: [timestamp]
        ignore_missing: true

---
# cat /etc/filebeat/inputs.d/docker.yml 
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  json:
    keys_under_root: true
    ignore_decoding_error: true

---
# cat /etc/filebeat/inputs.d/syslog.yml 
- type: filestream
  id: syslog
  paths:
    - '/var/log/syslog*'
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'syslog'

---
# cat /etc/filebeat/inputs.d/kernel.yml 
- type: filestream
  id: kernel
  paths:
    - '/var/log/kern.log'
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'kernel'

---
# cat /etc/filebeat/inputs.d/cloud-init.yml 
- type: filestream
  id: cloud-init
  paths:
    - '/var/log/cloud-init-output.log'
  ignore_older: '2h'
  close_inactive: '10m'
  exclude_lines: ['^\|', '^\+']
  parsers:
    - multiline:
        type: pattern
        pattern: '^ci-info.*'
        negate: false
        match: after
  processors:
    - add_fields:
        target: ''
        fields:
          type: 'cloud-init'

We didn't have this issue with Filebeat 8.6.1.

@erlandan
Copy link

erlandan commented Feb 8, 2024

We have this issue with 8.11.3 fleet managed elastic-agent running the kubernetes integration

@faec
Copy link
Contributor

faec commented Feb 21, 2024

I believe I've found the cause. It's a subtle side effect of #37077, which is supposed to keep producers from getting stuck when the queue shuts down. However, as written, the cancellation is triggered on shutdown of the queue producer rather than the queue itself. This means that if a producer is closed while it is waiting for a response from the queue, it can return failure even though the queue insert was successful (just incomplete). This makes it possible for that event to receive two "done" signals, one from the cancelled insert, and one when it is correctly acknowledged upstream. I'm preparing a fix.

faec added a commit that referenced this issue Mar 4, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.
mergify bot pushed a commit that referenced this issue Mar 4, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
mergify bot pushed a commit that referenced this issue Mar 4, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
faec added a commit that referenced this issue Mar 5, 2024
… closed (#38094) (#38178)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

Co-authored-by: Fae Charlton <[email protected]>
mergify bot pushed a commit that referenced this issue Mar 12, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
faec added a commit that referenced this issue Mar 12, 2024
…eue closed, not producer closed (#38177)

* Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

* fix backport

---------

Co-authored-by: Fae Charlton <[email protected]>
faec added a commit that referenced this issue Mar 12, 2024
…eue closed, not producer closed (#38279)

* Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

* fix backport

* fix NewQueue call

---------

Co-authored-by: Fae Charlton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants