Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: declare lock spans for AddSSTable #71676

Merged
merged 1 commit into from
Oct 26, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Oct 18, 2021

AddSSTable did not declare lock spans, even though it can return
WriteIntentError when encountering unresolved intents (e.g. when
checking for key collisions with DisallowShadowing set). This would
cause the concurrency manager to error with e.g.:

cannot handle WriteIntentError ‹conflicting intents on /Table/84/1/99/49714/0›
for request without lockTableGuard; were lock spans declared for this request?

This patch makes AddSSTable take out lock spans via
DefaultDeclareIsolatedKeys if DisallowShadowing is set. This will
automatically handle any unresolved intents via the concurrency manager.

Release note (bug fix): IMPORT INTO no longer crashes when encountering
unresolved write intents.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15
Copy link
Contributor

Thanks!

One question about the release note: I think this crash is only possible in the case of an IMPORT INTO (i.e. only when AddSSTRequest is sent to an active keyspace, since otherwise it cannot possibly encounter intents). If this is true (someone else could confirm?), it might be worth being more specific in the release note.

@nvanbenschoten
Copy link
Member

@dt I think part of why this was missed was because AddSSTable can only return a WriteIntentError if its DisallowShadowing flag is set. Could you confirm that even without DisallowShadowing, we would like AddSSTable to block on and attempt to resolve any overlapping intents that the lock table may know about?

If that is the case, then this change LGTM.

@erikgrinaker
Copy link
Contributor Author

One question about the release note: I think this crash is only possible in the case of an IMPORT INTO (i.e. only when AddSSTRequest is sent to an active keyspace, since otherwise it cannot possibly encounter intents). If this is true (someone else could confirm?), it might be worth being more specific in the release note.

True, I've looked it over and it seems to mostly affect IMPORT INTO. There is the following bit which is used (I think) when backfilling indexes and building tables from queries, but I suppose in those cases we're always writing into unused keyspace. Updated the release notes.

// We disallow shadowing here to ensure that we report errors when builds
// of unique indexes fail when there are duplicate values.
DisallowShadowing: true,

@erikgrinaker erikgrinaker force-pushed the addsstable-isolated branch 2 times, most recently from 6e630ae to 6a69c8e Compare October 18, 2021 20:25
@dt
Copy link
Member

dt commented Oct 18, 2021

I suppose when disallowshadowing is false, we don't strictly need the current AddSSTableRequest to block on and exclude creation of new intents in its span? Since it is already allowed to write under other writes, and with disallowshadowing false, its evaluation isn't conditional on the being able to correctly read the current content of that span either. I think there's still the concern about not letting it write a key that splits an intent from its value, but in backfills, where disallowshadowing is false we already take care to pick timestamps for the sst keys we know there are not and will never be intents at (by scanning that time first).

I wonder if IMPORT INTO should instead have been scanning the whole table, after it went offline, to create a timestamp at which it knows there are no intents and cannot be any later (thanks to tscache).

@erikgrinaker erikgrinaker force-pushed the addsstable-isolated branch 2 times, most recently from 4a3d22d to 335c1c6 Compare October 19, 2021 10:10
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 19, 2021

I suppose when disallowshadowing is false, we don't strictly need the current AddSSTableRequest to block on and exclude creation of new intents in its span? Since it is already allowed to write under other writes, and with disallowshadowing false, its evaluation isn't conditional on the being able to correctly read the current content of that span either. I think there's still the concern about not letting it write a key that splits an intent from its value, but in backfills, where disallowshadowing is false we already take care to pick timestamps for the sst keys we know there are not and will never be intents at (by scanning that time first).

Ok, I've updated the PR to only declare lock spans when DisallowShadowing is set. We should backport this, so it makes sense to restrict the change as much as possible.

I wonder if IMPORT INTO should instead have been scanning the whole table, after it went offline, to create a timestamp at which it knows there are no intents and cannot be any later (thanks to tscache).

That might make sense. The current approach also has the downside of returning a WriteIntentError for each encountered intent and retrying the AddSSTable request after resolving each one, so it'll be quadratic in the number of intents (see recent backup woes that had the same problem) -- opened #71697 for this.

Let's take this into account for the MVCCification work on AddSSTable, we'll need to figure out a consistent approach to concurrency control and intents.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/batcheval/cmd_add_sstable.go, line 44 at r1 (raw file):

	args := req.(*roachpb.AddSSTableRequest)
	// DisallowShadowing may encounter intents while checking for key collisions,
	// particularly in the case of IMPORT INTO.

This comment should speak more towards the intended behavior of AddSSTable with and without DisallowShadowing. Why is one able to ignore conflicting writes/intents while the other is not? And how do we want an AddSSTable that has DisallowShadowing set to true to interact with conflicting intents?

Also, the effect of this is that we will now begin waiting in the lock table for any intent that overlaps the AddSSTableRequest's key span, not just that collide with the individual keys in the sstable. Is this ok @dt? Could this lead to starvation where there are always overlapping intents at higher and higher timestamps?

This is leading me to question whether this is the right fix or whether the current interaction between AddSStableRequest and conflicting intents makes sense. If we addressed #71697, could we then treat intents the same way that we treat other values in checkForKeyCollisionsGo, throwing a hard error if they exist at lower timestamps and ignoring them if they exist at higher timestamps?

@sumeerbhola
Copy link
Collaborator

I think there's still the concern about not letting it write a key that splits an intent from its value

It's very unclear to me what invariants are being guaranteed by the layer above KV, and which subset of those we want to check even if there is a guarantee, and which are too expensive to check (I am hoping the latter set is the empty set, since checking the separated lock table is cheap). I think I am basically echoing what @nvanbenschoten said in his previous comment with "This comment should speak more ...".

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @erikgrinaker)


pkg/kv/kvserver/batcheval/cmd_add_sstable.go, line 44 at r1 (raw file):
Will look into the current AddSSTable semantics and guarantees, and write up a comment summarizing them.

This is leading me to question whether this is the right fix or whether the current interaction between AddSStableRequest and conflicting intents makes sense. If we addressed #71697, could we then treat intents the same way that we treat other values in checkForKeyCollisionsGo, throwing a hard error if they exist at lower timestamps and ignoring them if they exist at higher timestamps?

That sounds reasonable to me, given the current behavior. We'd still have to resolve the intents at a higher level though, to get the updated write timestamp, right? Which would involve returning a WriteIntentError in any case?

For MVCC-compliant AddSSTable we'd have to treat the intents as any other 1PC write would. But we'd still support a non-MVCC AddSSTable for e.g. streaming replication.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 21, 2021

Ok, this is a pretty hefty brain-dump, mostly for my own sake. I'm going to condense it down to a comment in this PR, unless anyone spots any glaring flaws. Will probably flesh this out as an RFC or tech note for MVCC-compliant AddSSTable too.

Tl;dr I think the current solution (declaring lock spans when DisallowShadowing == true) is the right fix, and we should address #71697. This provides a path to making AddSSTable fully MVCC-compliant, with a few tweaks to conflict handling (we may want more fine-grained locks though).

@nvanbenschoten's suggestion above doesn't seem right, because we want to error on any conflicting keys regardless of whether they are past or future, and we need to resolve the intent to see if it was committed or aborted. When DisallowShadowing == true, callers make sure the table is either offline or the table descriptor does not exist yet, so we should not conflict with any ongoing/future transactions.


General AddSSTable Behavior

  • Adds an SSTable to Pebble. This is equivalent to committing a Pebble batch
    containing only Set operations: it is atomic, and only affects the given keys.

  • Notable request flags:

    • isWrite: mutates data, so goes through Raft.
    • isRange: operates over a key range, where Key and EndKey are set to cover the first and last key in the SSTable.
    • isAlone: never appears in a batch with other requests.
    • isUnsplittable: cannot be split across multiple ranges by DistSender.
    • Not isTxn: cannot be in a transaction.
    • Not isLocking: does not take out transactional write locks (implied by !isTxn).
    • Not isIntentWrite: does not write intents (implied by !isTxn and !isLocking).
    • Not updatesTSCache: does not update the timestamp cache.
  • Does not respect MVCC invariants: can modify history, and can write at any
    timestamp.

  • DisallowShadowing: if true, does not allow writing keys that already exist
    (including future keys). Ignores tombstones below the write timestamp if that is
    the most recent version of the key. Allows idempotency (same key/value at same
    timestamp). Still violates MVCC, because it can write at old timestamps different
    from the request timestamp below the closed timestamp.

  • If no MVCC stats are provided by caller, scans the range to compute MVCC stats.

Current Implementation

  • Takes out write latches via DefaultDeclareKeys over the entire key span
    given by [Header.Key,Header.EndKey). This provides isolation from all
    concurrent requests within the declared key span.

  • Does not declare any lock spans. This does not provide isolation from
    concurrent transactions, and does not allow intent handling (specifically,
    returning a WriteIntentError when encountering an unresolved intent is
    invalid).

  • Does not write at the batch's write timestamp, and thus does not respect
    the closed timestamp: even if the Raft proposal buffer pushes the write
    timestamp into the future to comply with the closed timestamp, the request will
    still write at its original SST-specified timestamps.

Proposed Short-Term Fix

With DisallowShadowing: true, the request must error if any keys in the SST
already exist at any timestamp (including future keys, but ignoring tombstones
below the SST key timestamp). This means that if we encounter an intent, we must
return a WriteIntentError to attempt to resolve the intent and retry: if the
intent was aborted, the request may succeed, but if the intent was committed the
request must error.

Using DefaultDeclareIsolatedKeys when DisallowShadowing == true will take
out write locks across the entire [Key,EndKey) span, entering the wait queue
either if conflicting locks are known to be held by concurrent txns, or allowing
WriteIntentError to be returned if any intents are encountered to wait on
these. This will attempt to push the intent/lock owner, waiting for it to commit
or abort. #71697 should be addressed to avoid quadratic behavior when
encountering intents.

The lock queue is FIFO, so this may cause head-of-line blocking for other
transactions that want to acquire locks in this span. We can make the locks more
fine-grained by iterating over the SST and only taking out locks for the
specific keys we are writing, instead of the entire span. This will make
evaluation slower, but may reduce contention for spans that are seeing frequent
concurrent writes. This does not appear to be beneficial currently.

This will make AddSSTable behave more correctly wrt. concurrent writers.
However, since all AddSSTable requests are free to write at any timestamp,
and do not update the SST key timestamps after the request has been submitted
by the client, they still will not respect the closed timestamp.

Requests with DisallowShadowing: false will happily mutate MVCC history, so
there is no need to respect locks or resolve intents in this case (and thus to
declare lock spans): the value is going to be written regardless of whether an
intent commits or aborts. This is clearly unsafe from an isolation standpoint,
and AddSSTable puts the onus on the caller to make sure the span is not
accessed by past or concurrent transactions. The request still holds a write
latch across the [Key,EndKey) span, isolating it from concurrent requests.

MVCC-compliant AddSSTable

An MVCC-compliant AddSSTable should essentially behave like a 1PC Put batch:

  • Writes a set of keys to a single range as part of a single Raft command.
  • Cannot write at or below existing keys' timestamps, and must resolve intents.
  • Keys are written with the batch's timestamp, even when modified.
  • Does not write intents, as it cannot be part of a transaction.

An MVCC-compliant AddSSTable could therefore evaluate with DisallowShadowing: true,
if the checks were extended to return appropriate errors like Put,
e.g. when encountering future values. We could also add a new parameter
that allowed writing above existing values but not below them. Keys must be
written at the batch timestamp, even if the batch gets pushed or retried, to
respect the closed timestamp.

Current AddSSTable Callers

Callers which set DisallowShadowing: false must take particular care to avoid
interacting with spans that have been accessed by other transactions (past or
ongoing). Also, for future MVCC compliance, callers must be able to write at
the current batch timestamp.

Below is a list of current AddSSTable callers, noting DisallowShadowing and
the SST key timestamps for comparison, as well as their isolation mechanism:

Operation DisallowShadowing Timestamp Isolation
Imports true [2,3] Now Offline table
CREATE TABLE AS SELECT true Read TS? Table descriptor
Materialized views true Read TS Table descriptor
Index backfills false Now Index descriptor
Restore (backup) true Key TS Table descriptor?
Streaming replication false Key TS Offline tenant

From this, we see that only imports are vulnerable to existing intents, since
all other callers with DisallowShadowing: true rely on SQL descriptors for
isolation, and no other SQL transaction can have interacted with a span owned by
a descriptor that has not yet been created.

Similarly, we see that declaring lock spans for DisallowShadowing: true should
not conflict with active/future transactions, since the entities are either
offline or the descriptors do not exist yet. Notably, index backfills allow
shadowing, and so will not declare lock spans, thus will not prevent concurrent
index writes.

Furthermore, with the new incremental index backfill approach under development,
only backup restoration and streaming replication will have a potential need to
use non-MVCC AddSSTable. All other operations should be able to use
DisallowShadowing: true and write at the batch timestamp, while declaring lock
spans without interfering with concurrent transactions.

@erikgrinaker
Copy link
Contributor Author

Updated the PR with a comment summarizing AddSSTable concurrency considerations, PTAL.

@erikgrinaker erikgrinaker force-pushed the addsstable-isolated branch 4 times, most recently from c0c3006 to 81d755d Compare October 22, 2021 07:51
@nvanbenschoten
Copy link
Member

Thanks for the detailed analysis Erik! Everything you said above sounds correct, including your proposed next steps.

@nvanbenschoten's suggestion above doesn't seem right, because we want to error on any conflicting keys regardless of whether they are past or future, and we need to resolve the intent to see if it was committed or aborted.

Agreed. I had read the logic in checkForKeyCollisionsGo to mean that we were ignoring all future-time committed values. In reality, we were ignoring all past-time committed deletion tombstone values. I should have read it more carefully instead of relying on memory.

We can make the locks more fine-grained by iterating over the SST and only taking out locks for the specific keys we are writing, instead of the entire span. This will make evaluation slower, but may reduce contention for spans that are seeing frequent concurrent writes. This does not appear to be beneficial currently.

This does seem like the right approach if we want to avoid unnecessary head-of-line blocking for conflicting concurrent transactions. In theory, it may also be important to avoid starvation whereby the AddSSTable can never make it out of the concurrency manager because foreground traffic continues to establish new lock wait-queues on different keys. By design, the concurrency manager's handling of large ranged writes that touch uncountably many keys is not fair in this regard, as ranged writes do not prevent the establishment of new lock wait-queues on conflicting keys while waiting in existing lock wait-queues. Doing so would be a fairness-throughput tradeoff. However, your analysis in Current AddSSTable Callers indicates that this is likely not a problem in practice.

The request still holds a write latch across the [Key,EndKey) span, isolating it from concurrent requests.

This probably isn't a conversation for this PR, but @sumeerbhola and I were talking yesterday and were surprised that this doesn't lead to user-perceived latency during index builds. If an AddSSTable request is acquiring a wide write latch, it's going to block any other writer that overlaps with its span (even those with no logical overlap) for it's entire Raft pass, which could be quite expensive.

Do we see this show up in the foreground performance of writes to a table during an index backfill? Is this a concern?

EDIT: he just reminded me that we plan to do something about this in the future by performing the bulk of the backfill of new indexes while it is fully offline.

An MVCC-compliant AddSSTable could therefore evaluate with DisallowShadowing: true, if the checks were extended to return appropriate errors like Put, e.g. when encountering future values.

By this, do you mean that instead of throwing an error, checkForKeyCollisions would return the largest timestamp of any conflicting key-value and then the AddSSTable would shift its batch timestamp to write above this? From the KV perspective, this would be fine. However, I seem to recall that SQL also uses this error to detect uniqueness constraint violations, and so it wants to hear about these conflicts even if KV can avoid them.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment :lgtm:.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

We can make the locks more fine-grained by iterating over the SST and only taking out locks for the specific keys we are writing, instead of the entire span. This will make evaluation slower, but may reduce contention for spans that are seeing frequent concurrent writes. This does not appear to be beneficial currently.

This does seem like the right approach if we want to avoid unnecessary head-of-line blocking for conflicting concurrent transactions. In theory, it may also be important to avoid starvation whereby the AddSSTable can never make it out of the concurrency manager because foreground traffic continues to establish new lock wait-queues on different keys. By design, the concurrency manager's handling of large ranged writes that touch uncountably many keys is not fair in this regard, as ranged writes do not prevent the establishment of new lock wait-queues on conflicting keys while waiting in existing lock wait-queues. Doing so would be a fairness-throughput tradeoff. However, your analysis in Current AddSSTable Callers indicates that this is likely not a problem in practice.

I see, I wasn't aware that range laches were treated differently here -- I thought these wait-queues were always FIFO. That's worth keeping in mind.

I still think a range latch is the right call here, considering they are mostly used in empty key spans. These SSTs can contain in excess of 1 million keys, and I would expect taking out that many point latches to have a significant cost. If starvation is shown to be problematic, we can expose a request parameter to control this behavior.

The request still holds a write latch across the [Key,EndKey) span, isolating it from concurrent requests.

This probably isn't a conversation for this PR, but @sumeerbhola and I were talking yesterday and were surprised that this doesn't lead to user-perceived latency during index builds. If an AddSSTable request is acquiring a wide write latch, it's going to block any other writer that overlaps with its span (even those with no logical overlap) for it's entire Raft pass, which could be quite expensive.

Do we see this show up in the foreground performance of writes to a table during an index backfill? Is this a concern?

EDIT: he just reminded me that we plan to do something about this in the future by performing the bulk of the backfill of new indexes while it is fully offline.

Yes, this work is already underway -- see the RFC.

Your point still seems valid for the current index backfills though. I'm not intimately familiar with how they are implemented, but perhaps @dt happens to know otherwise. But it doesn't appear too important since we're overhauling this anyway.

An MVCC-compliant AddSSTable could therefore evaluate with DisallowShadowing: true, if the checks were extended to return appropriate errors like Put, e.g. when encountering future values.

By this, do you mean that instead of throwing an error, checkForKeyCollisions would return the largest timestamp of any conflicting key-value and then the AddSSTable would shift its batch timestamp to write above this? From the KV perspective, this would be fine. However, I seem to recall that SQL also uses this error to detect uniqueness constraint violations, and so it wants to hear about these conflicts even if KV can avoid them.

This wasn't thought all the way through. We should introduce a parameter that makes AddSSTable MVCC-compliant such that it mostly behaves like any other Put, e.g. it can write above (but not below) an existing version and returns WriteTooOldError where appropriate. We probably still want to keep DisallowShadowing for cases where we do not want to replace existing values at all.

Is there anything else outstanding for this PR then @nvanbenschoten, or is it good to merge?

@nvanbenschoten
Copy link
Member

Is there anything else outstanding for this PR then @nvanbenschoten, or is it good to merge?

This looks good to merge.

@nvanbenschoten
Copy link
Member

Although, I do think we should ask the question of why we never hit this crash in a test. That indicates that we never exercise the code path of an AddSSTable request hitting an intent at any level above the low-level cmd_add_sstable_test.go unit tests. We should fix that, maybe even ahead of addressing #71697.

@erikgrinaker
Copy link
Contributor Author

Although, I do think we should ask the question of why we never hit this crash in a test. That indicates that we never exercise the code path of an AddSSTable request hitting an intent at any level above the low-level cmd_add_sstable_test.go unit tests. We should fix that, maybe even ahead of addressing #71697.

Yep, added a test for it now.

I'm about to start work on an MVCC-compliant AddSSTable, I'll make sure to add comprehensive test coverage then.

`AddSSTable` did not declare lock spans, even though it can return
`WriteIntentError` when encountering unresolved intents (e.g. when
checking for key collisions with `DisallowShadowing` set). This would
cause the concurrency manager to error with e.g.:

```
cannot handle WriteIntentError ‹conflicting intents on /Table/84/1/99/49714/0›
for request without lockTableGuard; were lock spans declared for this request?
```

This patch makes `AddSSTable` take out lock spans via
`DefaultDeclareIsolatedKeys` if `DisallowShadowing` is set. This will
automatically handle any unresolved intents via the concurrency manager.

Release note (bug fix): `IMPORT INTO` no longer crashes when encountering
unresolved write intents.
@erikgrinaker
Copy link
Contributor Author

bors r=nvanbenschoten,dt,aayushshah15

@craig
Copy link
Contributor

craig bot commented Oct 26, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants