Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filelog receiver: the last event of every data file was lost #4849

Closed
polarG opened this issue Aug 24, 2021 · 7 comments
Closed

filelog receiver: the last event of every data file was lost #4849

polarG opened this issue Aug 24, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@polarG
Copy link

polarG commented Aug 24, 2021

Describe the bug
A clear and concise description of what the bug is.
The last event of every data files are lost.
Component with this issue: filelog

Steps to reproduce
If possible, provide a recipe for reproducing the error.
Data ingestion:

  • File number: 648
  • Quantity of event: 20,000,000

Test setup:
otel agent ---> otel agent (forwarder) ----> Splunk

  • otel agent to otel agent (forwarder) via OTLP
  • otel agent (forwarder) to Splunk via Splunk HEC

What did you expect to see?
A clear and concise description of what you expected to see.
In the splunk, we should see 20,000,000 events.

What did you see instead?
A clear and concise description of what you saw instead.
Lost 648 events. And from Splunk, we can see the lost events are the last event of each data file.

What version did you use?
Version: (e.g., v0.4.0, 1eb551b, etc)

What config did you use?
Config: (e.g. the yaml config file)
agent with file log receiver:
eceivers:
filelog:
include:
- /home/ansible/data/otel_mixed_data_20M/generate/Memory.perfmon*
- /home/ansible/data/otel_mixed_data_20M/generate/CPUTime.perfmon*
- /home/ansible/data/otel_mixed_data_20M/generate/change.security.628.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/change.security.1102.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/auth_win_Security.4776.windows*
- /home/ansible/data/otel_mixed_data_20M/generate/network_samplelog_tcp_connection_failure.cisco.asa*
- /home/ansible/data/otel_mixed_data_20M/generate/web_noncorpsite.websense*
- /home/ansible/data/otel_mixed_data_20M/generate/instrusion.cisco.asa*
multiline:
line_start_pattern: ^\d{2}/\d{2}/\d{2,4}\s+\d{2}:\d{2}:\d{2}
start_at: beginning
include_file_name: false

prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['0.0.0.0:8888']
metric_relabel_configs:
- source_labels: [ name ]
regex: '.grpc_io.'
action: drop

processors:
batch:

memory_limiter:
ballast_size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
check_interval: 2s
limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}

exporters:
otlphttp:
endpoint: "http://:55681"
insecure: true
compression: none/gzip

splunk_hec:
token: "${SPLUNK_HEC_TOKEN}"
endpoint: "${SPLUNK_HEC_URL}"
source: "metrics"
sourcetype: "otel_hostmetrics"

service:
pipelines:
logs:
receivers:
- filelog
processors: [memory_limiter, batch]
exporters: [otlphttp]

metrics:
  receivers:
    - prometheus
    # - hostmetrics
  processors: [memory_limiter, batch]
  exporters: [splunk_hec]

agent forwarder/adapter:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681

otlp/tls:
protocols:
grpc:
endpoint: 0.0.0.0:55690
tls_settings:
cert_file: /home/ansible/tls_cert/server.crt
key_file: /home/ansible/tls_cert/server.key

processors:
batch:

exporters:
splunk_hec:
token: "${SPLUNK_HEC_TOKEN}"
endpoint: "${SPLUNK_HEC_URL}"
source: "otel"
sourcetype: "otel_otlp"
max_connections: 1000

service:
pipelines:
logs:
receivers:
- otlp
- otlp/tls
processors: [batch]
exporters: [splunk_hec]
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Additional context
Add any other context about the problem here.

@tigrannajaryan

@polarG polarG added the bug Something isn't working label Aug 24, 2021
@polarG polarG changed the title the last event of every data file was lost filelog receiver: the last event of every data file was lost Aug 24, 2021
@tigrannajaryan
Copy link
Member

@djaglowski this sounds like a filelog bug, I suspect it may be losing the first or the last event in a file. Will you be able to look into this?

@djaglowski
Copy link
Member

@tigrannajaryan I agree it sounds like a bug.

I'm OOO until Thursday but will look into then.

@tigrannajaryan
Copy link
Member

@djaglowski thanks, I'll assign this to you.

@djaglowski
Copy link
Member

Here's what I believe is going on:

  • The filelog receiver consumes one log at a time. Before it can consume a log, it must conclude that it has found a complete log.
  • In the simple (default) case, a newline tells the receiver that the log is complete.
  • When using multiline.line_start_pattern, the receiver uses the specified pattern to determine that the previous log is complete.
  • When the last line in a file is written, it (of course) does not end with the specified pattern.
  • The receiver is waiting to see if there is additional content to this final log.

In a sense, this is working as intended, in that we're not "prematurely" consuming a log. However, the case of multiline.line_start_pattern clearly requires an accommodation for the case for the final log in each file.

Fortunately, I believe we have an appropriate mechanism in place.

  • As is, you can set the force_flush_period parameter to a non-zero value of your choice. This sets a maximum time that the receiver will wait. This should ensure that all logs are consumed. (Doc update pending. See Add documentation for filelog's new force_flush_period parameter #5066)
  • As a better "built-in" solution, I propose that we make the default value of force_flush_period dynamic, such that it is automatically non-zero when using multiline.line_start_pattern. I will create an issue to discuss this further.

@tigrannajaryan
Copy link
Member

Thanks for analysis. Perhaps we can just make force_flush_period non-zero by default regardless of multiline setting. I don't see huge downsides and it keeps things simple. I do not expect any log writers to write part of a log line, wait for a very long time then write the rest of the log line. Something around a few hundred ms should be enough for any log writer to write a complete log record.
Having a max waiting period also helps to ensure that the very last log record written is not delayed infinitely (e.g. if the writer for some reason decides that they write line breaks before new records not after new records).

@djaglowski
Copy link
Member

@tigrannajaryan Good point. I like the simpler version too, and agree there's little downside.

@sumo-drosiek, Can I ask your opinion on this, since you had the original use case and implemented the feature?

@sumo-drosiek
Copy link
Member

@djaglowski @tigrannajaryan
I agree that most of cases should be handled by relatively low value of force_flush_period. It has been set to 0 to keep backward compatibility, but it doesn't have much sense in real life, especially if files are being rotated, or nobody is writing to them at all.

There is no need in my opinion to make it dynamic, I would left it as simple value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants