-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244
Comments
cc @bharathv |
when you mean write caching, I assume you are using it as an umbrella term for acks=0/1 and actual write caching with acks=all?
TIL, that seems sketchy.. I don't fully understand that code but what happens when there is a truncation today during write caching? I think always using raft committed offset (like how stms do) seems less controversial to me. Even with rm_stm, it still returns high watermark equivalent if there is no acks=all data, that doesn't pose any problems?
I think that slows down consumers of acks=0/1 data. They have to wait for data to be raft committed until they consume, I remember changing it at some point and a bunch of acks=0/1 tests failed. |
@Lazin, please read this thread. tl;dr; I believe we need to define a "safe_to_upload" offset method in cloud storage which is max between LSO and committed offset. Nothing guarantees that |
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
the assumption was correct when it was introduced, because of that it might not be the only place where we make it |
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322 (cherry picked from commit ff358cc)
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322 (cherry picked from commit ff358cc)
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
In the write caching design we made the assumption that last stable offset is always equal-or-greater than the committed offset because when transactions are in use, we always wait for the commands to be applied to the STMs, and commands below committed offset are never applied.
The archival/tiered storage correctness assumption builds on that. Because we upper-bound the tiered storage data to the last stable offset, it should never see suffix truncations caused by data loss. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior.
redpanda/src/v/cluster/partition.cc
Lines 1248 to 1254 in 2823c09
Now, after unrelated auditing of the code base, I have discovered that we didn't account for what happens when transactions are disabled, i.e.
rm_stm
is not present on a partition. Last Stable Offset is defined to be equivalent to the High Watermark.That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.
What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.
This would keep the partition interface lean without the need of introducing another offset at this layer. It might come as a bit of a surprise that LSO doesn't match HWM when no transactions are in use. But it is already like this today when write caching is in use. Or when rm_stm lags behind applying commands (ref
redpanda/src/v/cluster/rm_stm.cc
Line 1293 in 88ac775
JIRA Link: CORE-2766
The text was updated successfully, but these errors were encountered: