filelog receiver: the last event of every data file was lost #4849

polarG · 2021-08-24T05:27:55Z

Describe the bug
A clear and concise description of what the bug is.
The last event of every data files are lost.
Component with this issue: filelog

Steps to reproduce
If possible, provide a recipe for reproducing the error.
Data ingestion:

File number: 648
Quantity of event: 20,000,000

Test setup:
otel agent ---> otel agent (forwarder) ----> Splunk

otel agent to otel agent (forwarder) via OTLP
otel agent (forwarder) to Splunk via Splunk HEC

What did you expect to see?
A clear and concise description of what you expected to see.
In the splunk, we should see 20,000,000 events.

What did you see instead?
A clear and concise description of what you saw instead.
Lost 648 events. And from Splunk, we can see the lost events are the last event of each data file.

What version did you use?
Version: (e.g., v0.4.0, 1eb551b, etc)

What config did you use?
Config: (e.g. the yaml config file)
agent with file log receiver:
eceivers:
filelog:
include:
- /home/ansible/data/otel_mixed_data_20M/generate/Memory.perfmon*
- /home/ansible/data/otel_mixed_data_20M/generate/CPUTime.perfmon*
- /home/ansible/data/otel_mixed_data_20M/generate/change.security.628.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/change.security.1102.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/auth_win_Security.4776.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/network_samplelog_tcp_connection_failure.cisco.asa*
- /home/ansible/data/otel_mixed_data_20M/generate/web_noncorpsite.websense*
- /home/ansible/data/otel_mixed_data_20M/generate/instrusion.cisco.asa*
multiline:
line_start_pattern: ^\d{2}/\d{2}/\d{2,4}\s+\d{2}:\d{2}:\d{2}
start_at: beginning
include_file_name: false

prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['0.0.0.0:8888']
metric_relabel_configs:
- source_labels: [ name ]
regex: '.grpc_io.'
action: drop

processors:
batch:

memory_limiter:
ballast_size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
check_interval: 2s
limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}

exporters:
otlphttp:
endpoint: "http://:55681"
insecure: true
compression: none/gzip

splunk_hec:
token: "${SPLUNK_HEC_TOKEN}"
endpoint: "${SPLUNK_HEC_URL}"
source: "metrics"
sourcetype: "otel_hostmetrics"

service:
pipelines:
logs:
receivers:
- filelog
processors: [memory_limiter, batch]
exporters: [otlphttp]

metrics:
  receivers:
    - prometheus
    # - hostmetrics
  processors: [memory_limiter, batch]
  exporters: [splunk_hec]

agent forwarder/adapter:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681

otlp/tls:
protocols:
grpc:
endpoint: 0.0.0.0:55690
tls_settings:
cert_file: /home/ansible/tls_cert/server.crt
key_file: /home/ansible/tls_cert/server.key

processors:
batch:

exporters:
splunk_hec:
token: "${SPLUNK_HEC_TOKEN}"
endpoint: "${SPLUNK_HEC_URL}"
source: "otel"
sourcetype: "otel_otlp"
max_connections: 1000

service:
pipelines:
logs:
receivers:
- otlp
- otlp/tls
processors: [batch]
exporters: [splunk_hec]
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Additional context
Add any other context about the problem here.

@tigrannajaryan

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2021-08-30T16:07:23Z

@djaglowski this sounds like a filelog bug, I suspect it may be losing the first or the last event in a file. Will you be able to look into this?

djaglowski · 2021-08-30T17:06:26Z

@tigrannajaryan I agree it sounds like a bug.

I'm OOO until Thursday but will look into then.

tigrannajaryan · 2021-08-30T17:10:29Z

@djaglowski thanks, I'll assign this to you.

djaglowski · 2021-09-02T15:56:39Z

Here's what I believe is going on:

The filelog receiver consumes one log at a time. Before it can consume a log, it must conclude that it has found a complete log.
In the simple (default) case, a newline tells the receiver that the log is complete.
When using multiline.line_start_pattern, the receiver uses the specified pattern to determine that the previous log is complete.
When the last line in a file is written, it (of course) does not end with the specified pattern.
The receiver is waiting to see if there is additional content to this final log.

In a sense, this is working as intended, in that we're not "prematurely" consuming a log. However, the case of multiline.line_start_pattern clearly requires an accommodation for the case for the final log in each file.

Fortunately, I believe we have an appropriate mechanism in place.

As is, you can set the force_flush_period parameter to a non-zero value of your choice. This sets a maximum time that the receiver will wait. This should ensure that all logs are consumed. (Doc update pending. See Add documentation for filelog's new force_flush_period parameter #5066)
As a better "built-in" solution, I propose that we make the default value of force_flush_period dynamic, such that it is automatically non-zero when using multiline.line_start_pattern. I will create an issue to discuss this further.

tigrannajaryan · 2021-09-02T17:57:58Z

Thanks for analysis. Perhaps we can just make force_flush_period non-zero by default regardless of multiline setting. I don't see huge downsides and it keeps things simple. I do not expect any log writers to write part of a log line, wait for a very long time then write the rest of the log line. Something around a few hundred ms should be enough for any log writer to write a complete log record.
Having a max waiting period also helps to ensure that the very last log record written is not delayed infinitely (e.g. if the writer for some reason decides that they write line breaks before new records not after new records).

djaglowski · 2021-09-02T18:02:57Z

@tigrannajaryan Good point. I like the simpler version too, and agree there's little downside.

@sumo-drosiek, Can I ask your opinion on this, since you had the original use case and implemented the feature?

sumo-drosiek · 2021-09-03T10:54:03Z

@djaglowski @tigrannajaryan
I agree that most of cases should be handled by relatively low value of force_flush_period. It has been set to 0 to keep backward compatibility, but it doesn't have much sense in real life, especially if files are being rotated, or nobody is writing to them at all.

There is no need in my opinion to make it dynamic, I would left it as simple value.

polarG added the bug Something isn't working label Aug 24, 2021

polarG changed the title ~~the last event of every data file was lost~~ filelog receiver: the last event of every data file was lost Aug 24, 2021

tigrannajaryan assigned djaglowski Aug 30, 2021

djaglowski mentioned this issue Sep 2, 2021

force_flush_period should default to non-zero value open-telemetry/opentelemetry-log-collection#257

Closed

djaglowski mentioned this issue Sep 7, 2021

Update force flush default open-telemetry/opentelemetry-log-collection#261

Merged

djaglowski closed this as completed Nov 12, 2021

sumo-drosiek mentioned this issue Mar 28, 2023

REQUEST: New membership for sumo-drosiek open-telemetry/community#1399

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filelog receiver: the last event of every data file was lost #4849

filelog receiver: the last event of every data file was lost #4849

polarG commented Aug 24, 2021 •

edited

Loading

tigrannajaryan commented Aug 30, 2021

djaglowski commented Aug 30, 2021

tigrannajaryan commented Aug 30, 2021

djaglowski commented Sep 2, 2021

tigrannajaryan commented Sep 2, 2021

djaglowski commented Sep 2, 2021

sumo-drosiek commented Sep 3, 2021

filelog receiver: the last event of every data file was lost #4849

filelog receiver: the last event of every data file was lost #4849

Comments

polarG commented Aug 24, 2021 • edited Loading

tigrannajaryan commented Aug 30, 2021

djaglowski commented Aug 30, 2021

tigrannajaryan commented Aug 30, 2021

djaglowski commented Sep 2, 2021

tigrannajaryan commented Sep 2, 2021

djaglowski commented Sep 2, 2021

sumo-drosiek commented Sep 3, 2021

polarG commented Aug 24, 2021 •

edited

Loading