You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
segment.ms and segment.bytes denote when we roll a segment and make it eligible for tiering. Their defaults are 7 days and 1 GiB respectively.
As far as I can tell, the reasoning for not having too small segment files is this:
open file descriptors. Each file is an open file descriptor. Kafka has been known to run out of file descriptors. And file descriptors are theorized to take up resource/memory usage, but I don't think that's true
write speed. Having to roll a segment may cause a slight hiccup in latency since it's more work than just writing to an open file
I don't think any of these reasons are particularly strong. I'm not sure I see a ton of drawbacks in having many segment files, but I may be missing some nuance.
Regardless. With tiered storage - we don't have many segment files. It's all in S3.
But waiting for 7 days for something to end up in S3 seems like it's way too much to me.
Waiting for 1 GiB per partition can also result in a lot of disk usage.
❌ In deployments with lots of partitions (e.g 5k per broker), you can end up hogging 4.8TiB of local storage without a good reason.
For tiered storage to move the needle, you need to make it affordable to deploy small & fast SSDs, and reduce the amount of data you need to move when reassigning (plus other stuff).
Solution
configure segment.bytes to something lower - e.g 50-100 MiB?
the only case where I imagine this can be disruptive is in high throughput partitions. If a partition is taking in 5 MiB/s, you wouldn't want it to roll a segment every 10 seconds. Having dynamic segment.bytes would be a pretty cool feature
configure segment.ms to something lower - e.g 12-24 hours? (assuming local.retention.ms is below that)
Do you see any drawbacks in these suggestions? Would be helpful to the community to talk about it and establish a convention
The text was updated successfully, but these errors were encountered:
@jlprat@giuseppelillo thanks for merging my docs PR. Is it possible to get some attention on this? Just looking to have a discussion, not necessarily an outcome
@stanislavkozlovski thanks for the questions! I totally agree these settings should be reconsidered when enabling topic's remote storage.
The suggested values also sound reasonable. To avoid rolling segments to often we could think of a minimum segment ms to comply to in case throughput pushes rotation too often (similar to fetch.mix/max.bytes but for time).
Dynamic configuration would be cool in many places, indeed. e.g. batching, etc.
A relevant implication for this plugin though, is that with the current API, pre-fetching is only possible within the same segment. Reducing the segment size will make this impact more evident.
Thinking about this we proposed https://cwiki.apache.org/confluence/display/KAFKA/KIP-1003%3A+Signal+next+segment+when+remote+fetching but has fall silent for a while (maybe it was proposed too early :)). If there is some positive feedback around this recommendation maybe that would bump this discussion.
Problem
segment.ms
andsegment.bytes
denote when we roll a segment and make it eligible for tiering. Their defaults are 7 days and 1 GiB respectively.As far as I can tell, the reasoning for not having too small segment files is this:
I don't think any of these reasons are particularly strong. I'm not sure I see a ton of drawbacks in having many segment files, but I may be missing some nuance.
Regardless. With tiered storage - we don't have many segment files. It's all in S3.
But waiting for 7 days for something to end up in S3 seems like it's way too much to me.
Waiting for 1 GiB per partition can also result in a lot of disk usage.
❌ In deployments with lots of partitions (e.g 5k per broker), you can end up hogging 4.8TiB of local storage without a good reason.
For tiered storage to move the needle, you need to make it affordable to deploy small & fast SSDs, and reduce the amount of data you need to move when reassigning (plus other stuff).
Solution
Do you see any drawbacks in these suggestions? Would be helpful to the community to talk about it and establish a convention
The text was updated successfully, but these errors were encountered: