-
Notifications
You must be signed in to change notification settings - Fork 15
sweep leaves sentinels and tombstones forever #2398
Comments
@fsamuel-bs @gbonik do you guys have insights on the last paragraph in the issue? (re |
Seems like we don't actually guarantee correctness for THOROUGH (SnapshotTransaction.java):
We could probably mitigate it by checking the immutableTs lock even when we have no writes, so we throw at the end of the transaction block. This, of course, still doesn't save us from the application code that modifies some non-local mutable data structure within a transaction (we'd need to check the lock after every read I guess). |
cool so some options here might be:
I'm not as concerned about users modifying data in the middle of a write transaction - that is bad practice and not a whole lot we can do to prevent it. |
I agree we can't do much to prevent it (other that checking the lock after every single read), but a user might not realize the problem exists. They could start with a read-only transaction, then realize that THOROUGH forces them to use a "regular" transaction, so they switch from runTaskReadOnly() to runTaskThrowOnConflict(), but still treat it as a read-only transaction. It's just counter-intuitive that we throw on commit rather than on read. You get some data back, but only later it turns out it wasn't valid. |
Maybe @carrino has some thoughts here as well. Maybe it actually isn't the worst thing to make every transaction lock an immutable timestamp.. the overhead could reduced to one call to check it's still valid, as we could share the lock across multiple short transactions. This would allow sweep/scrub to clean up tombstones very efficiently. |
After thinking about this more, I think we need to have some grace period after which we GC sentinels and tombstones. Read transactions can check how much time has elapsed using I'd really like to use wall-clock time here (i.e. the |
An alternative here would be to have a lock, similar to the immutable timestamp lock, but likely distinct, which all read transactions acquire. We can then know for sure the age of the oldest read transaction. This allows us not only to sweep thoroughly (i.e., remove sentinels and tombstones), but also to sweep very quickly after a value has been written. For example, if there are were other active transactions, we could actually sweep immediately after writing. This has two drawbacks, which I think can be solved:
We can make this feature opt-in due to the above drawbacks, but for most users I think these aren't very impactful. |
I think one problem here is aggressive hard delete - it needs to wipe the values "right now" regardless of existing read transactions. Currently, for each wiped cell it writes a sentinel that will invalidate any read transaction that encounters it. If we don't write sentinels, we lose that mechanism. I guess we can still write sentinels when hard-deleting. |
Yeah we could continue to write the sentinels in cases where we don't want to wait for transactions to finish. The proposal here would allow us to sweep these sentinels async, once the transactions did finish. |
How would you tell if a sentinel can be removed safely? (A sentinel means transaction timestamp of -1) |
Hmm, I guess in the current conservative sweep world, you always have a sentinel + a latest committed value. In that case you can remove the sentinel when the oldest reader is past the timestamp of the latest committed value. But if hard-delete ONLY writes a sentinel, we'd need to ensure that all transactions are past the timestamp of the value that was deleted and necessitated the sentinel. And we don't know that timestamp. So we'd either need to write a sentinel + a tombstone (do we do this already?), or sweep would have to do some weird logic where it gets a fresh timestamp and waits for the oldest reader to progress past it. |
Currently, every table defaults to
SweepStrategy.CONSERVATIVE
. This strategy leaves behind at least two values for every cell swept:Note that leaving the latest value is essentially necessary if we leave sentinels, otherwise every query for the cell would return the sentinel and be forced to abort/retry.
This behavior is "correct" because it will never violate snapshot isolation guarantees (a read transaction that started at time t will always either get a fully consistent snapshot as of t, or read a sweep sentinel and throw an error). If we were to remove the sentinel, it's possible that some super old read transaction could try to read a cell, and mistakenly think that the value is empty, when in reality the value for its start timestamp has been swept.
I think we need to strike a balance here between infallible correctness and actually being able to clean up old tombstones. For example, a well-behaved app client that very rarely deletes cells but still uses range scans would slowly, over time, become infinitely slow due to the number of tombstones/sentinels it reads.
One idea would be to enforce a time limit on readOnly transactions (i.e., throw an error if you have been open for longer than the limit), and allow sentinels and tombstones to be deleted after that limit.
(On another note, I'm not sure if we currently guarantee correctness when using
SweepStrategy.THOROUGH
. Using this disallows readOnly transactions , but still allows read/write transactions which might not actually do any writes, which are effectively the same if their immutableTs lock is lost)The text was updated successfully, but these errors were encountered: