Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we recommend lower segment.ms settings in the presence of this config? #617

Open
stanislavkozlovski opened this issue Oct 28, 2024 · 2 comments

Comments

@stanislavkozlovski
Copy link
Contributor

stanislavkozlovski commented Oct 28, 2024

Problem

segment.ms and segment.bytes denote when we roll a segment and make it eligible for tiering. Their defaults are 7 days and 1 GiB respectively.

As far as I can tell, the reasoning for not having too small segment files is this:

  • open file descriptors. Each file is an open file descriptor. Kafka has been known to run out of file descriptors. And file descriptors are theorized to take up resource/memory usage, but I don't think that's true
  • write speed. Having to roll a segment may cause a slight hiccup in latency since it's more work than just writing to an open file

I don't think any of these reasons are particularly strong. I'm not sure I see a ton of drawbacks in having many segment files, but I may be missing some nuance.
Regardless. With tiered storage - we don't have many segment files. It's all in S3.

But waiting for 7 days for something to end up in S3 seems like it's way too much to me.
Waiting for 1 GiB per partition can also result in a lot of disk usage.

❌ In deployments with lots of partitions (e.g 5k per broker), you can end up hogging 4.8TiB of local storage without a good reason.

For tiered storage to move the needle, you need to make it affordable to deploy small & fast SSDs, and reduce the amount of data you need to move when reassigning (plus other stuff).

Solution

  • configure segment.bytes to something lower - e.g 50-100 MiB?
    • the only case where I imagine this can be disruptive is in high throughput partitions. If a partition is taking in 5 MiB/s, you wouldn't want it to roll a segment every 10 seconds. Having dynamic segment.bytes would be a pretty cool feature
  • configure segment.ms to something lower - e.g 12-24 hours? (assuming local.retention.ms is below that)

Do you see any drawbacks in these suggestions? Would be helpful to the community to talk about it and establish a convention

@stanislavkozlovski
Copy link
Contributor Author

@jlprat @giuseppelillo thanks for merging my docs PR. Is it possible to get some attention on this? Just looking to have a discussion, not necessarily an outcome

@jeqo
Copy link
Contributor

jeqo commented Nov 5, 2024

@stanislavkozlovski thanks for the questions! I totally agree these settings should be reconsidered when enabling topic's remote storage.

The suggested values also sound reasonable. To avoid rolling segments to often we could think of a minimum segment ms to comply to in case throughput pushes rotation too often (similar to fetch.mix/max.bytes but for time).
Dynamic configuration would be cool in many places, indeed. e.g. batching, etc.

A relevant implication for this plugin though, is that with the current API, pre-fetching is only possible within the same segment. Reducing the segment size will make this impact more evident.
Thinking about this we proposed https://cwiki.apache.org/confluence/display/KAFKA/KIP-1003%3A+Signal+next+segment+when+remote+fetching but has fall silent for a while (maybe it was proposed too early :)). If there is some positive feedback around this recommendation maybe that would bump this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants