-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file_storage don't compact after some pv is full #26256
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Can we add more compact conditons like timeout? Will compact delete whole file or only compact some of the file? |
I'm not sure I can explain it more accurately than the documentation but here's the simplified mental model that I use to reason about this. We basically need to look at the file as a reservoir for data. If we mostly fill up the reservoir, then it will automatically expand. However, if we drain the reservoir, it will not automatically shrink. We can use compaction to shrink the reservoir, but it must remain at least as large as its contents. So typically we need to wait until contents of the reservoir have been substantially drained. Then we can actually shrink it by a meaningful amount. This is why we must meet two separate conditions before we run compaction.
Your graph showing used bytes may only be showing the size of the reservoir. Typically when compaction does not occur, it is because the contents are not draining as quickly as you expect.
The problem is that a timeout does not take into account whether or not the contents of the reservoir have been drained. There may be some value here because you could in theory reclaim some space, but it may often be a lot less than you'd expect, and you'll have spent a non-trivial amount of compute for it.
Compaction will create a complete copy of the file, edit the copy, and then finally overwrite the old file with the new. |
@djaglowski Thanks for your explanation. For the matter if reservoir drain or not. The log don't or no evidence can tell us whether it reach the compact condition, we are so confusing about that and can't find the pattern. But I learn to infer if data drain or not by calculating the metric (otelcol_receiver_accepted_log_records-otelcol_exporter_sent_log_records). So we found that even all data drain, the file don't compact. My situation using the filestorage is we use it as the exporter persistent queue. So my concern is if filestorage full will that: 1 impact the sending queue to persistent queue. As the behavior show, some of these still can accept/send data. but 2 even it can accept/send data, how about its performance? It can't use the file(10GB) to buffer anymore, right? The data only can buffer in memory(Max 2GB*75%). 3 Could we add more condition to the compact condition eg. sending_queue size? The existing one is too underneath for us and not clear. sending_queue drop to 0 means data already been send, or the metric(accepted-send) I mention. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I have not had time to look into this further. If anyone is able to provide a unit test that demonstrates the failure, this would make solving it much easier. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
exporter/sumologic, extension/storage/filestorage
What happened?
Description
Note more and more pv of otel collector exporter/sumo/file_storage getting full(10GB) and will not compact. Over time there are less and less pods can provide service because the sending_queue is full. Even recreate the pod will not trigger compact.
Steps to Reproduce
Guessing a lot of logs/entities are flushed to the same collectors whose exporter is using file_storage and last for 2 hours.
Architecture
A daemonset of otel collector responsible for collect docker logs -> a statefulset of otel collector to buffer before sending to -> storage backend.
Expected Result
a. the receiver keep receiving the data -> file_storage -> mmap()? -> sending_queue(memory)
b. both rebound_needed_threshold_mib and rebound_trigger_threshold_mib track the mmap()?
c. Is it possile that file_storage is stuck in sending_queue?
Workflow:
Daesmonset pod exporter -> (k8s) service -> receiver of Statefuleset pod
Actual Result
Only some of pod will auto-compact the file_storage, just as the chart shows.
Those with pv full can't not compact, looks like stop working, even after recreating pod.
Collector version
0.74.0
Environment information
Environment
Kubernetes 1.23.9
OpenTelemetry Collector configuration
statefulset otel configurations
Additional context
No response
The text was updated successfully, but these errors were encountered: