Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] For recombine operator, in some cases, the waiting time is much longer than forceFlushTimeout #20451

Closed
yutingcaicyt opened this issue Mar 29, 2023 · 6 comments · Fixed by #20742
Assignees
Labels
bug Something isn't working pkg/stanza

Comments

@yutingcaicyt
Copy link
Contributor

Component(s)

pkg/stanza

What happened?

Description

When I collect the log from the stdout of a pod which may contain several kinds of logs, some logs start with a fixed format, but some don't. In this case, sometimes I find the aggregation time in recombine is much longer than forceFlushTimeout.

Steps to Reproduce

config:

  • type: recombine
    combine_field: body
    force_flush_period: 5s
    is_first_entry: body matches "^[0-9]+-[0-9]+-[0-9]+"
    source_identifier: attributes["log.file.path"]

`
send the first log, this log starts with fixed format and match the "is_first_entry":

2023-03-29 xxxxxxxxxxfirst-logxxxxxxxxxx

wait 4s to send the second log, this log does not start with fixed format(The source of this log may be different from the first one, but all of them are printed to stdout):

xxxxxxxxxxsecond-logxxxxxxxxxx

wait 4s to send the 3th log:

xxxxxxxxxxthird-logxxxxxxxxxx

wait 4s to send the 4th log:

xxxxxxxxxx4th-logxxxxxxxxxx

.........Send a log like this every 4 seconds
`

Expected Result

the log can be flushed in 5s(the value of forceFlushTimeout)

Actual Result

these logs will stay in the recombine operator forever!

Collector version

v0.73.0

Environment information

any environment

OpenTelemetry Collector configuration

exporters:
      otlphttp:
        logs_endpoint: xxxx
    extensions:
      health_check: {}
      memory_ballast:
         size_mib: "1024"
    processors:
      batch:
        send_batch_size: 1000
      memory_limiter:
        check_interval: 5s
        limit_percentage: 50
        spike_limit_percentage: 30
      k8sattributes:
        passthrough: false
        filter:
          node_from_env_var: KUBE_NODE_NAME
        auth_type: "serviceAccount"
        extract:
          metadata:
          - k8s.pod.uid
          - k8s.node.name
          - k8s.pod.name
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
    receivers:
      filelog:
        exclude: []
        include:
        - /var/log/pods/test_*/http-server/*.log
        include_file_path: true
        include_file_name: false
        fingerprint_size: 19kb
        operators:
        - type: json_parser
          parse_from: body
        - parse_from: attributes["log.file.path"]
          regex: ^.*_.*_(?P<uid>[a-f0-9\-]+).*\.log$
          type: regex_parser
        - from: attributes.uid
          to: resource["k8s.pod.uid"]
          type: move
        - type: move
          from: attributes.log
          to: body
        - type: remove
          field: attributes.time
        - type: remove
          field: attributes.stream
        - type: recombine
          combine_field: body
          is_first_entry: body matches "^[0-9]+-[0-9]+-[0-9]+"
          source_identifier: attributes["log.file.path"]
        start_at: beginning
    service:
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        logs:
          exporters:
          - otlphttp
          processors:
          - memory_limiter
          - k8sattributes
          - batch
          receivers:
          - filelog

Log output

No response

Additional context

No response

@yutingcaicyt yutingcaicyt added bug Something isn't working needs triage New item requiring triage labels Mar 29, 2023
@yutingcaicyt
Copy link
Contributor Author

@djaglowski

@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme removed the needs triage New item requiring triage label Mar 29, 2023
@djaglowski
Copy link
Member

@yutingcaicyt, are you willing to look into this? Let me know and I will assign it.

@djaglowski
Copy link
Member

Possibly related to #18089

@yutingcaicyt
Copy link
Contributor Author

@yutingcaicyt, are you willing to look into this? Let me know and I will assign it.

Yeah, I'm willing to resolve it. Please assign it to me, thanks

@yutingcaicyt
Copy link
Contributor Author

Possibly related to #18089

It's not related to #18089. I find that the description of force_flush_period does not match its actual behavior. The problem is that the forceFlushTimeout is compared with timeSinceLastEntry but not timeSinceFirstEntry in the 'flushLoop'. I think comparing with timeSinceFirstEntry is more reasonable because normally it is completely sufficient to merge logs during the forceFlushTimeout period.

djaglowski pushed a commit that referenced this issue Apr 12, 2023
When using recombine operator, the behavior is different with description for force_flush_period. Fixing the issue that the actual timeout is much longer than force_flush_period.

In order to make the actual timeout closer to "force_flush_period", set the period of ticker to 1/5 "force_flush_period", so that the entries will be forced to be sent after waiting for at most 6/5 "force_flush_period"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pkg/stanza
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants