Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
63416: sql: emit point deletes during delete fastpath r=yuzefovich a=jordanlewis Previously, the "deleteRange" SQL operator, which is meant to be a fast-path for cases in which an entire range of keys can be deleted, always did what it said: emitted DeleteRange KV operations. This precludes a crucial optimization: sending point deletes when the list of deleted keys is exactly known. For example, a query like `DELETE FROM kv WHERE k = 10000` uses the "fast path" delete, since it has a contiguous set of keys to delete, and it doesn't need to know the values that were deleted. But, in this case, the performance is actually worse if we use a DeleteRange KV operation for various reasons (see #53939), because: - ranged KV writes (DeleteRangeRequest) cannot be pipelined because an enumeration of the intents that they will leave cannot be known ahead of time. They must therefore perform evaluation and replication synchronously. - ranged KV writes (DeleteRangeRequest) result in ranged intent resolution, which is less efficient (although this became less important since we re-enabled time-bound iterators). The reason we couldn't previously emit point deletes in this case is that SQL needs to know whether it deleted something or not. This means we can't do a "blind put" of a deletion: we need to actually understand whether there was something that we were "overwriting" with our delete. This commit modifies the DeleteResponse to always return a boolean indicating whether a key from the DeleteRequest was actually deleted. Additionally, the deleteRange SQL operator detects when it can emit single-key deletes, and does so. Closes #53939. Release note (performance improvement): point deletes in SQL are now more efficient during concurrent workloads. 76233: kv: remove clock update on BatchResponse r=nvanbenschoten a=nvanbenschoten Before this change, we were updating the local clock with each BatchResponse's WriteTimestamp. This was meant to handle cases where the batch request timestamp was forwarded during evaluation. This was unnecessary for two reasons. The first is that a BatchResponse can legitimately carry an operation timestamp that leads the local HLC clock on the leaseholder that evaluated the request. This has been true since #80706, which introduced the concept of a "local timestamp". This allowed us to remove the (broken) attempt at ensuring that the HLC on a leaseholder always leads the MVCC timestamp of all values in the leaseholder's keyspace (see the update to `pkg/kv/kvserver/uncertainty/doc.go` in that PR). The second was that it was not even correct. The idea behind bumping the HLC on the response path was to ensure that if a batch request was forwarded to a newer timestamp during evaluation and then completed a write, that forwarded timestamp would be reflected in the leaseholder's HLC. However, this ignored the fact that any forwarded timestamp must have either come from an existing value in the range or from the leaseholder's clock. So if those didn't lead the clock, the derived timestamp wouldn't either. It also ignored the fact that the clock bump here was too late (post-latch release) and if it had actually been needed (it wasn't), it wouldn't have even ensured that the timestamp on any lease transfer led the maximum time of any response served by the outgoing leaseholder. There are no mixed-version migration concerns of this change, because #80706 ensured that any future-time operation will still continue to use the synthetic bit until all nodes are running v22.2 or later. 85350: insights: ingester r=matthewtodd a=matthewtodd Closes #81021. Here we begin observing statements and transactions asynchronously, to avoid slowing down the hot sql execution path as much as possible. Release note: None 85440: colmem: improve memory-limiting behavior of the accounting helpers r=yuzefovich a=yuzefovich **colmem: introduce a helper method when no memory limit should be applied** This commit is a pure mechanical change. Release note: None **colmem: move some logic of capacity-limiting into the accounting helper** This commit moves the logic that was duplicated across each user of the SetAccountingHelper into the helper itself. Clearly, this allows us to de-duplicate some code, but it'll make it easier to refactor the code which is done in the following commit. Additionally, this commit makes a tiny change to make the resetting behavior in the hash aggregator more precise. Release note: None **colmem: improve memory-limiting behavior of the accounting helpers** This commit fixes an oversight in how we are allocating batches of the "dynamic" capacity. We have two related ways for reallocating batches, and both of them work by growing the capacity of the batch until the memory limit is exceeded, and then the batch would be reused until the end of the query execution. This is a reasonable heuristic under the assumption that all tuples in the data stream are roughly equal in size, but this might not be the case. In particular, consider an example when 10k small rows of 1KiB are followed by 10k large rows of 1MiB. According to our heuristic, we happily grow the batch until 1024 in capacity, and then we do not shrink the capacity of that batch, so once the large rows start appearing, we put 1GiB worth of data into a single batch, significantly exceeding our memory limit (usually 64MiB with the default `workmem` setting). This commit introduces a new heuristic as follows: - the first time a batch exceeds the memory limit, its capacity is memorized, and from now on that capacity will determine the upper bound on the capacities of the batches allocated through the helper; - if at any point in time a batch exceeds the memory limit by at least a factor of two, then that batch is discarded, and the capacity will never exceed half of the capacity of the discarded batch; - if the memory limit is not reached, then the behavior of the dynamic growth of the capacity provided by `Allocator.ResetMaybeReallocate` is still applicable (i.e. the capacities will grow exponentially until coldata.BatchSize()). Note that this heuristic does not have an ability to grow the maximum capacity once it's been set although it might make sense to do so (say, if after shrinking the capacity, the next five times we see that the batch is using less than half of the memory limit). This is an conscious omission since I want this change to be backported, and never growing seems like a safer choice. Thus, this improvement is left as a TODO. Also, we still might create batches that are too large in memory footprint in those places that don't use the SetAccountingHelper (e.g. in the columnarizer) since we perform the memory limit check at the batch granularity. However, this commit improves things there so that we don't reuse that batch on the next iteration and will use half of the capacity on the next iteration. Fixes: #76464. Release note (bug fix): CockroachDB now more precisely respects the `distsql_workmem` setting which improves the stability of each node and makes OOMs less likely. **colmem: unexport Allocator.ResetMaybeReallocate** This commit is a mechanical change to unexport `Allocator.ResetMaybeReallocate` so that the users would be forced to use the method with the same name from the helpers. This required splitting off the tests into two files. Release note: None 85492: backupccl: remap all restored tables r=dt a=dt This PR has a few changes, broken down into separate commits: a) stop restoring tmp tables and remove the special-case code to synthesize their special schemas; These were previously restored only to be dropped so that restored jobs that referenced them would not be broken, but we stopped restoring jobs. b) synthesize type-change jobs during cluster restore; this goes with not restoring jobs. c) fix some assumptions in tests/other code about what IDs restored tables have. d) finally, always assign new IDs to all restored objects, even during cluster restore, removing the need to carefully move conflicting tables or other things around. Commit-by-commit review recommended. 85930: jobs: make expiration use intended txn priority r=ajwerner a=rafiss In aed014f these operations were supposed to be changed to use MinUserPriority. However, they weren't using the appropriate txn, so it didn't have the intended effect. Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
- Loading branch information