[pkg/stanza] For recombine operator, in some cases, the waiting time is much longer than forceFlushTimeout #20451

yutingcaicyt · 2023-03-29T10:32:11Z

Component(s)

pkg/stanza

What happened?

Description

When I collect the log from the stdout of a pod which may contain several kinds of logs, some logs start with a fixed format, but some don't. In this case, sometimes I find the aggregation time in recombine is much longer than forceFlushTimeout.

Steps to Reproduce

config:

type: recombine
combine_field: body
force_flush_period: 5s
is_first_entry: body matches "^[0-9]+-[0-9]+-[0-9]+"
source_identifier: attributes["log.file.path"]

`
send the first log, this log starts with fixed format and match the "is_first_entry":

2023-03-29 xxxxxxxxxxfirst-logxxxxxxxxxx

wait 4s to send the second log, this log does not start with fixed format(The source of this log may be different from the first one, but all of them are printed to stdout):

xxxxxxxxxxsecond-logxxxxxxxxxx

wait 4s to send the 3th log:

xxxxxxxxxxthird-logxxxxxxxxxx

wait 4s to send the 4th log:

xxxxxxxxxx4th-logxxxxxxxxxx

.........Send a log like this every 4 seconds
`

Expected Result

the log can be flushed in 5s(the value of forceFlushTimeout)

Actual Result

these logs will stay in the recombine operator forever!

Collector version

v0.73.0

Environment information

any environment

OpenTelemetry Collector configuration

exporters:
      otlphttp:
        logs_endpoint: xxxx
    extensions:
      health_check: {}
      memory_ballast:
         size_mib: "1024"
    processors:
      batch:
        send_batch_size: 1000
      memory_limiter:
        check_interval: 5s
        limit_percentage: 50
        spike_limit_percentage: 30
      k8sattributes:
        passthrough: false
        filter:
          node_from_env_var: KUBE_NODE_NAME
        auth_type: "serviceAccount"
        extract:
          metadata:
          - k8s.pod.uid
          - k8s.node.name
          - k8s.pod.name
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
    receivers:
      filelog:
        exclude: []
        include:
        - /var/log/pods/test_*/http-server/*.log
        include_file_path: true
        include_file_name: false
        fingerprint_size: 19kb
        operators:
        - type: json_parser
          parse_from: body
        - parse_from: attributes["log.file.path"]
          regex: ^.*_.*_(?P<uid>[a-f0-9\-]+).*\.log$
          type: regex_parser
        - from: attributes.uid
          to: resource["k8s.pod.uid"]
          type: move
        - type: move
          from: attributes.log
          to: body
        - type: remove
          field: attributes.time
        - type: remove
          field: attributes.stream
        - type: recombine
          combine_field: body
          is_first_entry: body matches "^[0-9]+-[0-9]+-[0-9]+"
          source_identifier: attributes["log.file.path"]
        start_at: beginning
    service:
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        logs:
          exporters:
          - otlphttp
          processors:
          - memory_limiter
          - k8sattributes
          - batch
          receivers:
          - filelog

Log output

No response

Additional context

No response

yutingcaicyt · 2023-03-29T10:32:31Z

@djaglowski

github-actions · 2023-03-29T10:32:38Z

Pinging code owners:

pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2023-03-30T13:23:53Z

@yutingcaicyt, are you willing to look into this? Let me know and I will assign it.

djaglowski · 2023-03-30T13:24:08Z

Possibly related to #18089

yutingcaicyt · 2023-03-31T02:32:41Z

@yutingcaicyt, are you willing to look into this? Let me know and I will assign it.

Yeah, I'm willing to resolve it. Please assign it to me, thanks

yutingcaicyt · 2023-03-31T03:08:41Z

Possibly related to #18089

It's not related to #18089. I find that the description of force_flush_period does not match its actual behavior. The problem is that the forceFlushTimeout is compared with timeSinceLastEntry but not timeSinceFirstEntry in the 'flushLoop'. I think comparing with timeSinceFirstEntry is more reasonable because normally it is completely sufficient to merge logs during the forceFlushTimeout period.

When using recombine operator, the behavior is different with description for force_flush_period. Fixing the issue that the actual timeout is much longer than force_flush_period. In order to make the actual timeout closer to "force_flush_period", set the period of ticker to 1/5 "force_flush_period", so that the entries will be forced to be sent after waiting for at most 6/5 "force_flush_period"

yutingcaicyt added bug Something isn't working needs triage New item requiring triage labels Mar 29, 2023

github-actions bot added the pkg/stanza label Mar 29, 2023

atoulme removed the needs triage New item requiring triage label Mar 29, 2023

yutingcaicyt closed this as completed Mar 31, 2023

yutingcaicyt reopened this Mar 31, 2023

djaglowski assigned yutingcaicyt Mar 31, 2023

djaglowski mentioned this issue Apr 7, 2023

[pkg/stanza] fix recombine operator timeout issue (#20451) #20742

Merged

djaglowski closed this as completed in #20742 Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/stanza] For recombine operator, in some cases, the waiting time is much longer than forceFlushTimeout #20451

[pkg/stanza] For recombine operator, in some cases, the waiting time is much longer than forceFlushTimeout #20451

yutingcaicyt commented Mar 29, 2023

yutingcaicyt commented Mar 29, 2023

github-actions bot commented Mar 29, 2023

djaglowski commented Mar 30, 2023

djaglowski commented Mar 30, 2023

yutingcaicyt commented Mar 31, 2023

yutingcaicyt commented Mar 31, 2023