splunk_hec exporter sending events larger than default HEC maxEventSize #696

matthewmodestino · 2023-03-16T13:29:13Z

Hey Team,

In working with multiple cloud customers, I am seeing the OTel collector frequently encountering 400 Bad Request exporter error:

2023-03-15T19:07:28.615Z   error	exporterhelper/queued_retry.go:394  Exporting failed. The error is not retryable. Dropping data.   {"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: \"HTTP/1.1 400 Bad Request\\r\\nContent-Length: 65\\r\\nAlt-Svc: h3=\\\":443\\\"; ma=2592000,h3-29=\\\":443\\\"; ma=2592000\\r\\nContent-Type: application/json; charset=UTF-8\\r\\nDate: Wed, 15 Mar 2023 19:07:28 GMT\\r\\nServer: Splunkd\\r\\nVary: Authorization\\r\\nVia: 1.1 xxxxx\\r\\nX-Content-Type-Options: nosniff\\r\\nX-Frame-Options: SAMEORIGIN\\r\\n\\r\\n{\\\"text\\\":\\\"Invalid data format\\\",\\\"code\\\":6,\\\"invalid-event-number\\\":75}\"", "dropped_items": 966}

When checking on the Splunk side we can see this due to an event in the batch exceeding default event limit of 5MB

03-16-2023 02:45:48.988 +0000 ERROR HttpInputDataHandler [16637 HttpDedicatedIoThread-0] - Failed processing http input, token name=exampleCluster, channel=n/a, source_IP=x.x.x.x, reply=6, events_processed=0, http_input_body_size=7600767, parsing_err="While expecting event's raw text: String value too long. valueSize=5248313, maxValueSize=5242880, totalRequestSize=7600767"

inputs.conf.spec

maxEventSize = <positive integer>[KB|MB|GB]
* The maximum size of a single HEC (HTTP Event Collector) event.
* HEC disregards and triggers a parsing error for events whose size is
  greater than 'maxEventSize'.
* Default: 5MB

https://docs.splunk.com/Documentation/SplunkCloud/latest/Data/TroubleshootHTTPEventCollector

The HEC exporter is supposed to have a limit of 2MiB by default, so wondering if we have an edge case where the limit is not respected? My intial hunch is it may be possibly due to our default recombine operator, as k8s container engines would be splitting anything over 8192 in containerd or 16384 in Docker....

max_content_length_logs (default: 2097152): Maximum log payload size in bytes. Log batches of bigger size will be broken down into several requests. Default value is 2097152 bytes (2 MiB). Maximum allowed value is 838860800 (~ 800 MB). Keep in mind that Splunk Observability backend doesn't accept requests bigger than 2 MiB. This configuration value can be raised only if used with Splunk Core/Cloud. When set to 0, it will treat as infinite length and it will create only 1 request per batch.

Customers will file tickets internally, please have a look and advise.

The text was updated successfully, but these errors were encountered:

VihasMakwana · 2023-03-17T13:22:59Z

@matthewmodestino this scenario is reproducible if the length of one log message is > than splunk's limit.
The current implementation breaks down multiple log messages into batches, but it doesn't break down one log message.
It will try to send it as a complete entity.

Example to clear this out:

LogA
LogB
LogC
LogD

Current implementation will group {LogA, LogB} and {LogC, LogD} but it's can't break down LogA into multiple batches

matthewmodestino · 2023-03-17T13:27:05Z

Ah Ok! I will test this locally, so it does seem to be the default recombine rules combining partial logs into one large log over the limit? Kubernetes container engines break the streams down to smaller chunks so it must be the recombine putting them back together.

splunk-otel-collector-chart/examples/only-logs-otel/rendered_manifests/configmap-agent.yaml

Lines 162 to 168 in 3471875

    
                   - combine_field: attributes.log 
        
                     combine_with: "" 
        
                     id: crio-recombine 
        
                     is_last_entry: attributes.logtag == 'F' 
        
                     output: handle_empty_log 
        
                     source_identifier: attributes["log.file.path"] 
        
                     type: recombine

splunk-otel-collector-chart/examples/only-logs-otel/rendered_manifests/configmap-agent.yaml

Lines 175 to 181 in 3471875

    
                   - combine_field: attributes.log 
        
                     combine_with: "" 
        
                     id: containerd-recombine 
        
                     is_last_entry: attributes.logtag == 'F' 
        
                     output: handle_empty_log 
        
                     source_identifier: attributes["log.file.path"] 
        
                     type: recombine

splunk-otel-collector-chart/examples/only-logs-otel/rendered_manifests/configmap-agent.yaml

Lines 187 to 193 in 3471875

    
                   - combine_field: attributes.log 
        
                     combine_with: "" 
        
                     id: docker-recombine 
        
                     is_last_entry: attributes.log endsWith "\n" 
        
                     output: handle_empty_log 
        
                     source_identifier: attributes["log.file.path"] 
        
                     type: recombine

I am curious as to why the hec exporter max_content_length_logs (2MiB) doesn't get enforced...I assume it's logic is not looking at the recombined attributes.log or something?

matthewmodestino · 2023-03-21T15:29:04Z

I have successfully replicated this issue by enabling compression on the exporter and generating sample events over 5MB.

When compression is disabled, the exporter does not allow the batches to be sent due to the max_content_length

splunk-otel-collector-chart/helm-charts/splunk-otel-collector/values.yaml

Line 48 in 3471875

disableCompression: true

2023-03-21T15:02:10.913Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: dropped log event error: event size 5248947 bytes larger than configured max content length 2097152 bytes", "dropped_items": 1}

However when compression is enabled, we no longer it the max content length and the events are sent to splunk where they are rejected:

2023-03-21T14:47:17.526Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: \"HTTP/1.1 400 Bad Request\\r\\nContent-Length: 64\\r\\nConnection: Keep-Alive\\r\\nContent-Type: application/json; charset=UTF-8\\r\\nDate: Tue, 21 Mar 2023 14:47:17 GMT\\r\\nServer: Splunkd\\r\\nVary: Authorization\\r\\nX-Content-Type-Options: nosniff\\r\\nX-Frame-Options: SAMEORIGIN\\r\\n\\r\\n{\\\"text\\\":\\\"Invalid data format\\\",\\\"code\\\":6,\\\"invalid-event-number\\\":4}\"", "dropped_items": 5}
[go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:394](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:394)
[go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:135](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:135)
[go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:205](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:205)
[go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1](http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61)

This will likely result in us needing some logic to either, truncate events in the processing pipeline, or to extract what the users are interested in, then dump the rest of the payload. Will work with some customers to see what the events are and whether they are even necessary.

VihasMakwana · 2023-03-23T13:40:39Z

@matthewmodestino is maxEventSize before or after decompression of the event sent?

VihasMakwana · 2023-03-24T08:45:33Z

So here's a thing.
I wrote a script that sends compressed data to Splunk, but Splunk still rejects it.

Uncompressed size: 7MB, above Splunk limit
Compressed size: ~7KB

I think the setting maxEventSize is after decompression in Splunk.

Script:

import gzip
import requests
import json
headers = {
    "Authorization":  "Splunk 00000000-0000-0000-0000-000000000000",
    "Content-Encoding": "gzip",
    "Content-Type": "application/json"
}

long_string = "r"*(1024*1024*7) # ~7MiB string
body = json.dumps({"event":long_string,})
data = gzip.compress(bytes(body, encoding="utf-8"))
print("Compressed data size: ", len(data), "bytes")
print("Uncompressed data size: ", len(body), "bytes")
rs = requests.post( "http://x.x.x.x:8088/services/collector" ,data=data, headers=headers,)
print(rs.text)

cc: @atoulme @dmitryax

matthewmodestino · 2023-03-24T13:22:21Z

Yes Splunk will receive, decompress then parse/index. It will be based on the raw event size in the batch. post decompress.

One thing i noticed is otel has a truncate processor function. One thing we could do is truncate events at the max_content_length. That way we at least send most of the event to splunk instead of dropping it on otel side.

VihasMakwana · 2023-03-27T09:58:53Z

well, for the splunk-hec-exporter side, we can't do much.

max_content_length_logs is applied to the data sent, regardless of compression.
If compression is enabled, max_content_length_logs will be on compressed data.
If compression is disabled, max_content_length_logs will be on uncompressed data.

VihasMakwana · 2023-03-27T09:59:29Z

We can ask client to set disableCompression to True

matthewmodestino · 2023-03-28T16:56:37Z

Simply removing compression is not an acceptable resolution. I am working with some high volume customers that rely on the compression to reduce the impact on the network. We will need to put logic either in the batch or otherwise to protect against 5MB+ payloads.

On prem users can up the limit on their HFs to workaround, but customer sending directly to cloud generally can't.

VihasMakwana · 2023-03-28T16:59:52Z

I understand, we can ask them to increase maxEventSize limit.
It's 800mib max I guess.

matthewmodestino · 2023-03-28T17:02:51Z

We can do that on-prem but not in cloud. We will need to find a resolution to this in the collector logic.

VihasMakwana · 2023-03-29T05:28:57Z

@matthewmodestino, what I'm thinking:
1). introduce a new config variable, max_payload_size. This will be used to protect against huge payloads.
2). we'll apply this just before sending it to splunk, so we can set this limit as per our convenience. it can be 5MB by default.

VihasMakwana · 2023-08-02T10:37:05Z

@atoulme can we close this one?

atoulme · 2023-08-02T15:24:52Z

Closing. Thanks!

VihasMakwana mentioned this issue Mar 30, 2023

[exporter/splunkhec] Allow to drop log messages longer than a given max length open-telemetry/opentelemetry-collector-contrib#18066

Closed

atoulme added bug Something isn't working Splunk Platform Issue related to Splunk Platform destination labels Apr 5, 2023

atoulme closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

splunk_hec exporter sending events larger than default HEC maxEventSize #696

splunk_hec exporter sending events larger than default HEC maxEventSize #696

matthewmodestino commented Mar 16, 2023 •

edited

Loading

VihasMakwana commented Mar 17, 2023

matthewmodestino commented Mar 17, 2023 •

edited

Loading

matthewmodestino commented Mar 21, 2023 •

edited

Loading

VihasMakwana commented Mar 23, 2023

VihasMakwana commented Mar 24, 2023

matthewmodestino commented Mar 24, 2023 •

edited

Loading

VihasMakwana commented Mar 27, 2023

VihasMakwana commented Mar 27, 2023

matthewmodestino commented Mar 28, 2023 •

edited

Loading

VihasMakwana commented Mar 28, 2023

matthewmodestino commented Mar 28, 2023

VihasMakwana commented Mar 29, 2023

VihasMakwana commented Aug 2, 2023

atoulme commented Aug 2, 2023

splunk_hec exporter sending events larger than default HEC maxEventSize #696

splunk_hec exporter sending events larger than default HEC maxEventSize #696

Comments

matthewmodestino commented Mar 16, 2023 • edited Loading

VihasMakwana commented Mar 17, 2023

matthewmodestino commented Mar 17, 2023 • edited Loading

matthewmodestino commented Mar 21, 2023 • edited Loading

VihasMakwana commented Mar 23, 2023

VihasMakwana commented Mar 24, 2023

matthewmodestino commented Mar 24, 2023 • edited Loading

VihasMakwana commented Mar 27, 2023

VihasMakwana commented Mar 27, 2023

matthewmodestino commented Mar 28, 2023 • edited Loading

VihasMakwana commented Mar 28, 2023

matthewmodestino commented Mar 28, 2023

VihasMakwana commented Mar 29, 2023

VihasMakwana commented Aug 2, 2023

atoulme commented Aug 2, 2023

matthewmodestino commented Mar 16, 2023 •

edited

Loading

matthewmodestino commented Mar 17, 2023 •

edited

Loading

matthewmodestino commented Mar 21, 2023 •

edited

Loading

matthewmodestino commented Mar 24, 2023 •

edited

Loading

matthewmodestino commented Mar 28, 2023 •

edited

Loading