Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and dropped resouces #28900

Closed
yanamg7 opened this issue Nov 6, 2023 · 10 comments
Closed

Performance and dropped resouces #28900

yanamg7 opened this issue Nov 6, 2023 · 10 comments
Assignees

Comments

@yanamg7
Copy link

yanamg7 commented Nov 6, 2023

Component(s)

No response

What happened?

Description

We conducted POC for the routing methods. When we used routing connectors we saw the resources drop and increase the memory usage.
We tried sending 15K resources per second: logs and metrics, otlp, and filereceiver. We tried to run the same traffic using multiple receivers (otlp receiver using several ports, filereceiver using different file names matching) and traffic was processed without errors. I tried it without profiling but after seeing errors, I added profiling

Steps to Reproduce

Run traffic 15K resources per sec

Expected Result

No dropped resources

Actual Result

Dropped resources and error, increased memory and CPU

Collector version

otelcol-contrib version 0.88.0

Environment information

Environment

Ubuntu 22.04.2 LTS
pod resources:
limits:
memory: 3Gi
requests:
memory: 500Mi

OpenTelemetry Collector configuration

---
extensions:
  zpages: null
  pprof/1:
        endpoint: ":1777"
        block_profile_fraction: 3
        mutex_profile_fraction: 5
        save_to_file: "/prof_file.ppf"
receivers:
  filelog:
    exclude: []
    include:
      - /TODA/bench_log/*.log
    include_file_name: true
    include_file_path: true
    operators:
      - field: attributes.log
        type: add
        value: unspecified
      - field: attributes.ct
        type: add
        value: unspecified
      - field: attributes.scid
        type: add
        value: unspecified
      - field: attributes.sysid
        type: add
        value: unspecified
      - field: attributes.sev
        type: add
        value: unspecified
      - field: attributes.ts
        type: add
        value: 1970-01-01 00:00:00.000000 UTC
      - field: attributes.mid
        type: add
        value: unspecified
      - type: json_parser
      - mapping:
          fatal: emergency
          info4: notice
        parse_from: attributes.sev
        type: severity_parser
      - from: attributes.log
        id: move1
        to: resource["Log Message"]
        type: move
      - from: attributes.ct
        id: move2
        to: resource["Container Name"]
        type: move
      - from: attributes.scid
        id: move3
        to: resource["Unique Container ID"]
        type: move
      - from: attributes.sysid
        id: move4
        to: resource["System ID"]
        type: move
      - from: attributes.sev
        id: move5
        to: resource["Severity"]
        type: move
      - from: attributes.ts
        id: move6
        to: resource["Timestamp"]
        type: move
      - from: attributes.mid
        id: move7
        to: resource["Message ID"]
        type: move  

    resource:
      - dataType: bigipn-pods-log
    start_at: beginning
    
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4617
        tls:
          cert_file: ${OTEL_TLS_CERT_FILE}
          client_ca_file: ${OTEL_TLS_CA_FILE}
          key_file: ${OTEL_TLS_KEY_FILE}
          min_version: "1.2"
          reload_interval: 100000
  
processors:
  batch: null

exporters:
    file:
      path: /var/log/otlp.json
    logging:
      sampling_initial: 0
      sampling_thereafter: 0
      verbosity: basic
    otlp/2c27f38b-3399-4531-b66c-f557f646ebb0:
        endpoint: 10.144.74.39:4417
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/ccb8b939-5dab-465f-bcf4-921385d8001d:
        endpoint: 10.144.74.39:4517
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea:
        endpoint: 10.144.74.39:4317
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
connectors:
  routing/logs:
      default_pipelines: [logs/otlp-all]
      error_mode: propagate
      table:
        - statement: route() where IsMatch(attributes["FileName"], ".*access.*log")
          pipelines: [logs/access]
        - statement: route() where IsMatch(attributes["FileName"], ".*sec.*log")
          pipelines: [logs/sec]
        - statement: route() where not IsMatch(attributes["FileName"], ".*sec.*log") and not IsMatch(attributes["FileName"], ".*access.*log") 
          pipelines: [logs/container]
  routing/metrics:
      default_pipelines: [metrics/1]
      error_mode: propagate
      table:
        - statement: route() where attributes["dataType"] == "type1"
          pipelines: [metrics/1]
        - statement: route() where attributes["dataType"] == "type2"
          pipelines: [metrics/2]
        - statement: route() where attributes["dataType"] == "type3"
          pipelines: [metrics/3]
service:
  pipelines:
    logs/in:
      receivers: [filelog]
      exporters: [routing/logs]
    metrics/in:
      receivers: [otlp]
      exporters: [routing/metrics] 
    logs/otlp-all:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/logs
    logs/access:
      exporters:
        - logging
        - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
      receivers:
        - routing/logs
    logs/container:
      exporters:
        - logging
        - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
      receivers:
        - routing/logs
    logs/sec:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/logs
    metrics/1:
      exporters:
        - logging
        - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
      receivers:
        - routing/metrics
    metrics/2:
      exporters:
        - logging
        - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
      receivers:
        - routing/metrics
    metrics/3:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/metrics 
  extensions:
    - zpages
    - pprof/1

Log output

2023-11-06T11:09:53.092Z	error	exporterhelper/queue_sender.go:184	Dropping data because sending_queue is full. Try increasing queue_size.	{"kind": "exporter", "data_type": "logs", "name": "otlp/2c27f38b-3399-4531-b66c-f557f646ebb0", "dropped_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:184
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
	go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:196
go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/logs.go:100
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
	go.opentelemetry.io/collector/[email protected]/logs.go:25
go.opentelemetry.io/collector/internal/fanoutconsumer.(*logsConsumer).ConsumeLogs
	go.opentelemetry.io/[email protected]/internal/fanoutconsumer/logs.go:75
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
	go.opentelemetry.io/collector/[email protected]/logs.go:25
github.com/open-telemetry/opentelemetry-collector-contrib/connector/routingconnector.(*logsConnector).ConsumeLogs
	github.com/open-telemetry/opentelemetry-collector-contrib/connector/[email protected]/logs.go:97
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
	go.opentelemetry.io/collector/[email protected]/logs.go:25
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/consumerretry.(*logsConsumer).ConsumeLogs
	github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/consumerretry/logs.go:37
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/adapter/receiver.go:124
2023-11-06T11:09:53.719Z	error	consumerretry/logs.go:39	ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping.	{"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/consumerretry.(*logsConsumer).ConsumeLogs
	github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/consumerretry/logs.go:39
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/adapter/receiver.go:124
2023-11-06T11:09:53.720Z	error	adapter/receiver.go:126	ConsumeLogs() failed	{"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/adapter/receiver.go:126

Additional context

No response

@yanamg7 yanamg7 added bug Something isn't working needs triage New item requiring triage labels Nov 6, 2023
@yanamg7
Copy link
Author

yanamg7 commented Nov 6, 2023

This is the full output log otel_log.txt

Copy link
Contributor

github-actions bot commented Nov 6, 2023

Pinging code owners for connector/routing: @jpkrohling @mwear. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @yanamg7, from your logs:

2023-11-06T11:09:53.092Z	error	exporterhelper/queue_sender.go:184	Dropping data because sending_queue is full. Try increasing queue_size.	{"kind": "exporter", "data_type": "logs", "name": "otlp/2c27f38b-3399-4531-b66c-f557f646ebb0", "dropped_items": 100}

It looks like the default queue size is 1000 for the otlp exporter, but you've set it to 5 in your configuration. Have you tested using this default?

Also, another error in your logs:

2023-11-06T11:09:53.719Z	error	consumerretry/logs.go:39	ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping.	{"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}

Have you tried this with filelog receiver's retry_on_failure option enabled?

@yanamg7
Copy link
Author

yanamg7 commented Nov 7, 2023

Hi @crobert-1
I'm using exactly the same configuration as I used for otel configuration without connector. Here is it

extensions:
    zpages: null
receivers:
    filelog:
        exclude: []
        include:
            - /TODA/log/*.log
        include_file_name: true
        include_file_path: true
        operators:
            - field: attributes.log
              type: add
              value: unspecified
            - field: attributes.ct
              type: add
              value: unspecified
            - field: attributes.scid
              type: add
              value: unspecified
            - field: attributes.sysid
              type: add
              value: unspecified
            - field: attributes.sev
              type: add
              value: unspecified
            - field: attributes.ts
              type: add
              value: 1970-01-01 00:00:00.000000 UTC
            - field: attributes.mid
              type: add
              value: unspecified
            - type: json_parser
            - mapping:
                fatal: emergency
                info4: notice
              parse_from: attributes.sev
              type: severity_parser
            - from: attributes.log
              id: move1
              to: resource["Log Message"]
              type: move
            - from: attributes.ct
              id: move2
              to: resource["Container Name"]
              type: move
            - from: attributes.scid
              id: move3
              to: resource["Unique Container ID"]
              type: move
            - from: attributes.sysid
              id: move4
              to: resource["System ID"]
              type: move
            - from: attributes.sev
              id: move5
              to: resource["Severity"]
              type: move
            - from: attributes.ts
              id: move6
              to: resource["Timestamp"]
              type: move
            - from: attributes.mid
              id: move7
              to: resource["Message ID"]
              type: move
        resource:
            - dataType: bigipn-pods-log
        start_at: beginning
    otlp/mtls:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4517
                tls:
                    cert_file: ${OTEL_TLS_CERT_FILE}
                    client_ca_file: ${OTEL_TLS_CA_FILE}
                    key_file: ${OTEL_TLS_KEY_FILE}
                    min_version: "1.2"
                    reload_interval: 100000
processors:
    batch: null
exporters:
    file:
        path: /var/log/otlp.json
    logging:
        sampling_initial: 0
        sampling_thereafter: 0
        verbosity: basic
    otlp/2c27f38b-3399-4531-b66c-f557f646ebb0:
        endpoint: 10.144.74.39:4417
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/ccb8b939-5dab-465f-bcf4-921385d8001d:
        endpoint: 10.144.74.39:4517
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea:
        endpoint: 10.144.74.39:4317
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
service:
    pipelines:
        logs:
            exporters:
                - logging
                - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
                - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
                - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
            processors:
                - batch
            receivers:
                - otlp/mtls
                - filelog
        metrics:
            exporters:
                - logging
                - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
                - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
                - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
            processors:
                - batch
            receivers:
                - otlp/mtls
    extensions:
        - zpages

It works OK, with no errors, and no dropped resources. We need connector routing for our logic, I'm trying to add it and have these problems.
I'll try the recommended default queue size and retry on failure for receivers. But the question is why this configuration works without connector routing and has so many dropped resources with connector routing.

@yanamg7
Copy link
Author

yanamg7 commented Nov 8, 2023

Hi @crobert-1. I can't use the default queue size, the pod is limited, in this case, I prefer to lose some resources instead of crashing with a memory issue. Still have the same question: in multiple receivers, it works OK, with connector routing it works slowly

@crobert-1
Copy link
Member

crobert-1 commented Nov 8, 2023

Hello @yanamg7, I'm going to defer to the code owners for an answer here, as they are better suited to answer this.

I expect that this is simply a result of the queue size being so small. The routing connector may introduce a "small" delay in the data pipeline (checking data values and figuring out where to route them), and this "small" delay may be enough to overload your receiver's queue. This is just a hunch though, it may be something more, that's why I prefer to let others who are more informed to take a look.

@mwear
Copy link
Member

mwear commented Nov 9, 2023

Hi @yanamg7. I suspect that @crobert-1's hunch is probably correct. If you wanted to compare the performance with the routing processor you could. If you try the processor, you'll have to update your routing rules to route to exporters (instead of piplines) and you'll have to prefix your ottl statements with resource.. See the config here for an example. If you do this and you find a notable difference, please let us know.

I'm not sure if you have experimented with the memory limiter processor, but it might be another way to help you limit resources while using a larger queue size.

@crobert-1 crobert-1 added question Further information is requested and removed bug Something isn't working needs triage New item requiring triage labels Nov 9, 2023
@jpkrohling
Copy link
Member

I'll try to run a similar setup locally, but without access to the metrics, it's hard to tell which component is the bottleneck here.

@jpkrohling jpkrohling self-assigned this Nov 27, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 29, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants