Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splunk_hec exporter sending events larger than default HEC maxEventSize #696

Closed
matthewmodestino opened this issue Mar 16, 2023 · 14 comments
Labels
bug Something isn't working Splunk Platform Issue related to Splunk Platform destination

Comments

@matthewmodestino
Copy link

matthewmodestino commented Mar 16, 2023

Hey Team,

In working with multiple cloud customers, I am seeing the OTel collector frequently encountering 400 Bad Request exporter error:

2023-03-15T19:07:28.615Z   error	exporterhelper/queued_retry.go:394  Exporting failed. The error is not retryable. Dropping data.   {"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: \"HTTP/1.1 400 Bad Request\\r\\nContent-Length: 65\\r\\nAlt-Svc: h3=\\\":443\\\"; ma=2592000,h3-29=\\\":443\\\"; ma=2592000\\r\\nContent-Type: application/json; charset=UTF-8\\r\\nDate: Wed, 15 Mar 2023 19:07:28 GMT\\r\\nServer: Splunkd\\r\\nVary: Authorization\\r\\nVia: 1.1 xxxxx\\r\\nX-Content-Type-Options: nosniff\\r\\nX-Frame-Options: SAMEORIGIN\\r\\n\\r\\n{\\\"text\\\":\\\"Invalid data format\\\",\\\"code\\\":6,\\\"invalid-event-number\\\":75}\"", "dropped_items": 966}

When checking on the Splunk side we can see this due to an event in the batch exceeding default event limit of 5MB

03-16-2023 02:45:48.988 +0000 ERROR HttpInputDataHandler [16637 HttpDedicatedIoThread-0] - Failed processing http input, token name=exampleCluster, channel=n/a, source_IP=x.x.x.x, reply=6, events_processed=0, http_input_body_size=7600767, parsing_err="While expecting event's raw text: String value too long. valueSize=5248313, maxValueSize=5242880, totalRequestSize=7600767"

inputs.conf.spec

maxEventSize = <positive integer>[KB|MB|GB]
* The maximum size of a single HEC (HTTP Event Collector) event.
* HEC disregards and triggers a parsing error for events whose size is
  greater than 'maxEventSize'.
* Default: 5MB

https://docs.splunk.com/Documentation/SplunkCloud/latest/Data/TroubleshootHTTPEventCollector

The HEC exporter is supposed to have a limit of 2MiB by default, so wondering if we have an edge case where the limit is not respected? My intial hunch is it may be possibly due to our default recombine operator, as k8s container engines would be splitting anything over 8192 in containerd or 16384 in Docker....

max_content_length_logs (default: 2097152): Maximum log payload size in bytes. Log batches of bigger size will be broken down into several requests. Default value is 2097152 bytes (2 MiB). Maximum allowed value is 838860800 (~ 800 MB). Keep in mind that Splunk Observability backend doesn't accept requests bigger than 2 MiB. This configuration value can be raised only if used with Splunk Core/Cloud. When set to 0, it will treat as infinite length and it will create only 1 request per batch.

Customers will file tickets internally, please have a look and advise.

@VihasMakwana
Copy link
Contributor

@matthewmodestino this scenario is reproducible if the length of one log message is > than splunk's limit.
The current implementation breaks down multiple log messages into batches, but it doesn't break down one log message.
It will try to send it as a complete entity.

Example to clear this out:

LogA
LogB
LogC
LogD

Current implementation will group {LogA, LogB} and {LogC, LogD} but it's can't break down LogA into multiple batches

@matthewmodestino
Copy link
Author

matthewmodestino commented Mar 17, 2023

Ah Ok! I will test this locally, so it does seem to be the default recombine rules combining partial logs into one large log over the limit? Kubernetes container engines break the streams down to smaller chunks so it must be the recombine putting them back together.

- combine_field: attributes.log
combine_with: ""
id: crio-recombine
is_last_entry: attributes.logtag == 'F'
output: handle_empty_log
source_identifier: attributes["log.file.path"]
type: recombine

- combine_field: attributes.log
combine_with: ""
id: containerd-recombine
is_last_entry: attributes.logtag == 'F'
output: handle_empty_log
source_identifier: attributes["log.file.path"]
type: recombine

- combine_field: attributes.log
combine_with: ""
id: docker-recombine
is_last_entry: attributes.log endsWith "\n"
output: handle_empty_log
source_identifier: attributes["log.file.path"]
type: recombine

I am curious as to why the hec exporter max_content_length_logs (2MiB) doesn't get enforced...I assume it's logic is not looking at the recombined attributes.log or something?

@matthewmodestino
Copy link
Author

matthewmodestino commented Mar 21, 2023

I have successfully replicated this issue by enabling compression on the exporter and generating sample events over 5MB.

When compression is disabled, the exporter does not allow the batches to be sent due to the max_content_length

2023-03-21T15:02:10.913Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: dropped log event error: event size 5248947 bytes larger than configured max content length 2097152 bytes", "dropped_items": 1}

However when compression is enabled, we no longer it the max content length and the events are sent to splunk where they are rejected:

2023-03-21T14:47:17.526Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "logs", "name": "splunk_hec/platform_logs", "error": "Permanent error: \"HTTP/1.1 400 Bad Request\\r\\nContent-Length: 64\\r\\nConnection: Keep-Alive\\r\\nContent-Type: application/json; charset=UTF-8\\r\\nDate: Tue, 21 Mar 2023 14:47:17 GMT\\r\\nServer: Splunkd\\r\\nVary: Authorization\\r\\nX-Content-Type-Options: nosniff\\r\\nX-Frame-Options: SAMEORIGIN\\r\\n\\r\\n{\\\"text\\\":\\\"Invalid data format\\\",\\\"code\\\":6,\\\"invalid-event-number\\\":4}\"", "dropped_items": 5}
[go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:394](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:394)
[go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:135](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:135)
[go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1](http://go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:205](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:205)
[go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1](http://go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1)
	[go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61](http://go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61)

This will likely result in us needing some logic to either, truncate events in the processing pipeline, or to extract what the users are interested in, then dump the rest of the payload. Will work with some customers to see what the events are and whether they are even necessary.

@VihasMakwana
Copy link
Contributor

@matthewmodestino is maxEventSize before or after decompression of the event sent?

@VihasMakwana
Copy link
Contributor

So here's a thing.
I wrote a script that sends compressed data to Splunk, but Splunk still rejects it.

Uncompressed size: 7MB, above Splunk limit
Compressed size: ~7KB

I think the setting maxEventSize is after decompression in Splunk.

Script:

import gzip
import requests
import json
headers = {
    "Authorization":  "Splunk 00000000-0000-0000-0000-000000000000",
    "Content-Encoding": "gzip",
    "Content-Type": "application/json"
}

long_string = "r"*(1024*1024*7) # ~7MiB string
body = json.dumps({"event":long_string,})
data = gzip.compress(bytes(body, encoding="utf-8"))
print("Compressed data size: ", len(data), "bytes")
print("Uncompressed data size: ", len(body), "bytes")
rs = requests.post( "http://x.x.x.x:8088/services/collector" ,data=data, headers=headers,)
print(rs.text)

cc: @atoulme @dmitryax

@matthewmodestino
Copy link
Author

matthewmodestino commented Mar 24, 2023

Yes Splunk will receive, decompress then parse/index. It will be based on the raw event size in the batch. post decompress.

One thing i noticed is otel has a truncate processor function. One thing we could do is truncate events at the max_content_length. That way we at least send most of the event to splunk instead of dropping it on otel side.

@VihasMakwana
Copy link
Contributor

well, for the splunk-hec-exporter side, we can't do much.

max_content_length_logs is applied to the data sent, regardless of compression.
If compression is enabled, max_content_length_logs will be on compressed data.
If compression is disabled, max_content_length_logs will be on uncompressed data.

@VihasMakwana
Copy link
Contributor

We can ask client to set disableCompression to True

@matthewmodestino
Copy link
Author

matthewmodestino commented Mar 28, 2023

Simply removing compression is not an acceptable resolution. I am working with some high volume customers that rely on the compression to reduce the impact on the network. We will need to put logic either in the batch or otherwise to protect against 5MB+ payloads.

On prem users can up the limit on their HFs to workaround, but customer sending directly to cloud generally can't.

@VihasMakwana
Copy link
Contributor

I understand, we can ask them to increase maxEventSize limit.
It's 800mib max I guess.

@matthewmodestino
Copy link
Author

We can do that on-prem but not in cloud. We will need to find a resolution to this in the collector logic.

@VihasMakwana
Copy link
Contributor

@matthewmodestino, what I'm thinking:
1). introduce a new config variable, max_payload_size. This will be used to protect against huge payloads.
2). we'll apply this just before sending it to splunk, so we can set this limit as per our convenience. it can be 5MB by default.

@VihasMakwana
Copy link
Contributor

@atoulme can we close this one?

@atoulme
Copy link
Contributor

atoulme commented Aug 2, 2023

Closing. Thanks!

@atoulme atoulme closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Splunk Platform Issue related to Splunk Platform destination
Projects
None yet
Development

No branches or pull requests

3 participants