Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/filelog] loops forever with large log lines #16487

Closed
alexvanboxel opened this issue Nov 27, 2022 · 15 comments
Closed

[receiver/filelog] loops forever with large log lines #16487

alexvanboxel opened this issue Nov 27, 2022 · 15 comments
Assignees
Labels
bug Something isn't working priority:p2 Medium receiver/filelog

Comments

@alexvanboxel
Copy link
Contributor

alexvanboxel commented Nov 27, 2022

Component(s)

receiver/filelog

What happened?

Description

The filelog receiver loops forever if a single logline is too large. It happens at the default 1MiB but can be reproduced when setting the max_log_size: 17Kib on smaller lines.

Steps to Reproduce

  1. Set the max_log_size above the minimum of 16KiB, example : 17KiB (see example config)
  2. Create a log file with a line at least 32KiB long
  3. Make sure the log file is included in the config file (see example config)

Expected Result

Any of the following

  • truncated log line, skipping to the EOL, continue with the rest of the log file
  • truncated log line, continue reading the rest of the line, outputting the logline per max_log_size
  • dropping log line

Actual Result

Infinite loop with log entry too large

Collector version

v0.66.0 (but is not new...)

Environment information

Environment

OS: "Apple 13.00"
Compiler(if manually compiled): "go 1.19.1"

OpenTelemetry Collector configuration

receivers:
  filelog:
    start_at: beginning
    max_log_size: 17KiB # only breaks above 16K buffer minimum
    include:
    - /path/to/file/with/32k_logline/log_32k.txt

exporters:
  logging:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: []
      exporters: [ logging ]

Log output

Forever loop:

2022-11-27T13:58:28.632+0100    error   fileconsumer/reader.go:78       Failed during scan      {"kind": "receiver", "name": "filelog/json", "pipeline": "logs", "component": "fileconsumer", "path": "/Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/dist/log_32k.txt", "error": "log entry too large"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Reader).ReadToEnd
        /Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/reader.go:78
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).consume.func1
        /Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go:137
2022-11-27T13:58:29.632+0100    error   fileconsumer/reader.go:78       Failed during scan      {"kind": "receiver", "name": "filelog/json", "pipeline": "logs", "component": "fileconsumer", "path": "/Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/dist/log_32k.txt", "error": "log entry too large"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Reader).ReadToEnd
        /Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/reader.go:78
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer.(*Manager).consume.func1
        /Users/alex.vanboxel/src/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file.go:137

Additional context

start_at: beginning doesn't influence the error loop; it makes it easier to reproduce.

bufio.Scanner is used for scanning the files, a worrying note in the documentation is:

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token. Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.

@alexvanboxel alexvanboxel added bug Something isn't working needs triage New item requiring triage labels Nov 27, 2022
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@alexvanboxel
Copy link
Contributor Author

@djaglowski I want to help on this if needed; for us, this is a blocker to go "GA" for our log collection through the collector. But I first like to discuss strategy.

@djaglowski djaglowski added priority:p2 Medium and removed needs triage New item requiring triage labels Nov 28, 2022
@djaglowski
Copy link
Member

@alexvanboxel, I have limited capacity this week so will not be able to investigate in detail. If you have time to work out a fix, I will prioritize giving feedback.

@alexvanboxel
Copy link
Contributor Author

I'm almost afraid, that it would mean moving from Scanner to Reader, with means a lot of work. But I like a second opinion. We have had this problem for a while, so it can wait: I prefer to talk it over before beginning work.

@djaglowski
Copy link
Member

I think you are probably right about what it would require to fix.

@maokitty
Copy link

maokitty commented Dec 16, 2022

I have meet the same problem ,and find out a quick way to fix this .
I'd like to share my method first ,since i have not commit before. if this is good , i will commit a fix for more detail these days。
Here is the core code
`

		if  errors.Is(scanner.Err(), bufio.ErrTooLong) {
			if _, err := r.file.Seek(r.Offset, 0); err != nil {
				r.Errorw("Failed to seek", zap.Error(err))
				return
			}
			buff :=make([]byte,r.maxLogSize)
			len, err := r.Read(buff)
			if err != nil {
				r.Error("fail for scan")
			}
			r.Offset = r.Offset+int64(len)
			token, err := r.splitter.Encoding.Decode(buff)
			if err != nil {
				r.Errorw("decode: %w", zap.Error(err))
			} else {
				r.emit(ctx, r.fileAttributes, token)
			}
		}

`

@fatsheep9146
Copy link
Contributor

I have meet the same problem ,and find out a quick way to fix this .

@djaglowski what do you think about this fix ?

@djaglowski
Copy link
Member

@fatsheep9146, I think this approach makes sense. @maokitty, thanks for looking into this - I'd be happy to review a PR.

@fatsheep9146
Copy link
Contributor

@maokitty welcome for your contribution!

@haoqixu
Copy link
Member

haoqixu commented Dec 30, 2022

I'm almost afraid, that it would mean moving from Scanner to Reader, with means a lot of work. But I like a second opinion. We have had this problem for a while, so it can wait: I prefer to talk it over before beginning work.

I tried to fix it without moving from Scanner to Reader. A SplitFunc can return (0, nil, nil) to signal the Scanner to read more data. The PositionalScanner can check the returned value from the splitFunc in its SplitFunc wrapper and modify the returned value to truncate the log entry. The fix is in #17309

@djaglowski
Copy link
Member

Thanks for both contributions @maokitty and @haoqixu. I am trying to assess which is preferable. My first thought is that #17309 is less complex and therefore may be better.

From a user perspective, is there any difference in the behavior of the two?

@haoqixu
Copy link
Member

haoqixu commented Dec 30, 2022

#17301 seems not to recover the scanner? It breaks the for loop at the end of the if !ok {} path.

@haoqixu
Copy link
Member

haoqixu commented Dec 30, 2022

#17301 seems not to recover the scanner? It breaks the for loop at the end of the if !ok {} path.

The testcase TestTokenization/large_log_line in #17301 was modified to call r.ReadToEnd() twice. When I remove the second r.ReadToEnd() and append more lines to the fileContent of this testcase, the test fails with Timed out waiting for token error. It seems that the scanner failed to emit more log entries.

@haoqixu
Copy link
Member

haoqixu commented Jan 3, 2023

I have created another PR using a different strategy to truncate the log entry.

A scanner with the LineStartPattern may break the start line of the following log entry when truncating a long log entry. For example:

|<------------- log_max_size --------->|
<log_start_line>...log_content...<log_start_line>...log_content...<log_start_line>

Both #17301 and #17309 use the same strategy to truncate the log and emits:

|<------------- log_max_size --------->|
<log_start_line>...log_content...<log_st
art_line>...log_content...

#17358 uses a double size (log_max_size * 2) buffer to avoid breaking the start line and emits:

|<------------- log_max_size --------->|
<log_start_line>...log_content...
<log_start_line>...log_content...

From a user perspective, the output of #17358 may be better? 🤔

@djaglowski
Copy link
Member

Thank you @haoqixu and @maokitty! Resolved by #17358

gantrior added a commit to solarwinds/swi-k8s-opentelemetry-collector that referenced this issue Jan 24, 2023
This brings following fixes that are related to filelog reciever:
* filelogreceiver: Truncate log entry if it is longer than max_log_size ([#16487](open-telemetry/opentelemetry-collector-contrib#16487))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Medium receiver/filelog
Projects
None yet
Development

No branches or pull requests

5 participants