Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add time-bound iterator handling for range keys #82596

Closed
erikgrinaker opened this issue Jun 8, 2022 · 5 comments · Fixed by #82667
Closed

storage: add time-bound iterator handling for range keys #82596

erikgrinaker opened this issue Jun 8, 2022 · 5 comments · Fixed by #82667
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 8, 2022

We currently use table/block filters for time-bound iteration of point keys. We need this to handle range keys too. At the very least, it cannot have false negatives (skip tables/blocks with range keys that are within the range key interval). It would be great if we could rely on pebble.IterOptions.RangeKeyFilters for this, but all Pebble databases may not support it so it's possible this would require a costly data migration. In the worst case, just make sure we don't skip over range key blocks at all.

if opts.MaxTimestampHint.IsSet() {
encodedMinTS := string(encodeMVCCTimestamp(opts.MinTimestampHint))
encodedMaxTS := string(encodeMVCCTimestamp(opts.MaxTimestampHint))
p.options.TableFilter = func(userProps map[string]string) bool {
tableMinTS := userProps["crdb.ts.min"]
if len(tableMinTS) == 0 {
if opts.WithStats {
p.timeBoundNumSSTables++
}
return true
}
tableMaxTS := userProps["crdb.ts.max"]
if len(tableMaxTS) == 0 {
if opts.WithStats {
p.timeBoundNumSSTables++
}
return true
}
used := encodedMaxTS >= tableMinTS && encodedMinTS <= tableMaxTS
if used && opts.WithStats {
p.timeBoundNumSSTables++
}
return used
}
// We are given an inclusive [MinTimestampHint, MaxTimestampHint]. The
// MVCCWAllTimeIntervalCollector has collected the WallTimes and we need
// [min, max), i.e., exclusive on the upper bound.
p.options.PointKeyFilters = []pebble.BlockPropertyFilter{
sstable.NewBlockIntervalFilter(mvccWallTimeIntervalCollector,
uint64(opts.MinTimestampHint.WallTime),
uint64(opts.MaxTimestampHint.WallTime)+1),
}
}

Jira issue: CRDB-16526

Epic CRDB-2624

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-kv-replication labels Jun 8, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 8, 2022

cc @cockroachdb/replication

@blathers-crl blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 8, 2022
@erikgrinaker erikgrinaker self-assigned this Jun 8, 2022
@erikgrinaker
Copy link
Contributor Author

@jbowens Re: our previous discussion about this. Seems like we should move to block property collectors for TBIs. These appear to already support range keys.

Given that enabling range keys requires ratcheting the Pebble database version and CRDB cluster version, we can assume that once we start writing these block properties then readers will respect them. We can also stop writing table properties.

I don't think we need to do anything more here? Older SSTables that contain existing table props would be filtered like today based on point keys, and these cannot contain range keys so we're not incorrectly filtering anything here. Newer SSTables would not have table props, and thus would only be filtered on block properties, which are written accurately both for point and range keys.

Am I missing anything?

@nicktrav
Copy link
Collaborator

nicktrav commented Jun 9, 2022

I just created #82678 to work out if / how we can go about removing table properties entirely. We've been talking about this on Storage for a while.

@erikgrinaker
Copy link
Contributor Author

I just created #82678 to work out if / how we can go about removing table properties entirely. We've been talking about this on Storage for a while.

Way ahead of you: #82658

We'll just need to keep respecting them for as long as nodes can still have SSTs written by 21.2 or older versions. Not sure how best to ensure that though -- might need an actual migration in 23.1 or something.

@nicktrav
Copy link
Collaborator

nicktrav commented Jun 9, 2022

might need an actual migration in 23.1 or something.

Yeah, that issue is more for the migration. I'll update it to make it more pointed.

We have some precedent for doing these two step migrations now (see #77634 and cockroachdb/pebble#1537), so hopefully we can reuse some of that functionality.

craig bot pushed a commit that referenced this issue Jun 9, 2022
81071: sql: Allow DEFAULT/ON UPDATE types that can be assignment-cast r=e-mbrown a=e-mbrown

Resolves #74854

postgres:
```
postgres=# create table casts (b BOOL DEFAULT 'foo'::VARCHAR);
 ERROR:  column "b" is of type boolean but default expression is of type character varying>
HINT:  You will need to rewrite or cast the expression.
```
crdb:
  ```
[email protected]:26257/movr> CREATE TABLE t (b BOOL DEFAULT 'foo'::VARCHAR);
CREATE TABLE

Time: 8ms total (execution 8ms / network 0ms)
```

Release note (sql change): A column's DEFAULT/ON UPDATE clause
can now have a type that differs from the column type, as long as
that type can be assignment-cast into the column's type. This
increases our PostgreSQL compatibility.

81160: opt: simplify lookup join constraint building r=mgartner a=mgartner

#### opt: simplify lookup join constraint building

The lookup join constraint builder has been simplified so that it builds
constraints in a single pass over an index's columns. This is in
contrast to the previous method which would require a second pass over
the index's columns after it was discovered that a multi-span lookup
join with a lookup expression would be required. This change has a
number of consequences:

  1. Lookup joins can be planned in more cases. As an example, a lookup
     join can now be planned when there is an equality filter on a
     indexed computed column and a range filter on another indexed
     column.
  2. Constant value filters, like `x = 1`, are no longer included in
     lookup expressions. Instead, the constant value will be projected
     and the lookup expression will contain an equality and the
     projected column, like `x = projected_col`. This should have
     little, if any, effect on the performance of lookup joins.

This change required a minor change to the
`GenerateLocalityOptimizedAntiJoin` rule to ensure that projected
constant values are produced by the the inner anti-join. This is
necessary because the outer anti-join's lookup expression references the
projected column. This wasn't previously necessary because constant
values weren't projected and referenced in lookup expressions - they
were inlined directly in lookup expressions.

Release note: None

#### opt: remove "cross-join" from some tests

As of #79586, we no longer plan cross-joins as inputs of lookup joins.
This commit updates some test comments to reflect this.

Release note: None


81713: kvserver: benchmark empty replica addition/removal r=aayushshah15 a=aayushshah15

This commit adds a small benchmark to track low level regressions in the time it takes to add/remove an empty range's replica.

Relates to #72083

Release note: None

82317: ccl/sqlproxyccl: enable a more graceful shutdown r=pjtatlow a=pjtatlow

When a drainSignal is received, the sql proxy now waits for all connections to
close within a certain time limit (59 minutes) before shutting down.

The next drainSignal will be ignored, but the third will forcefully shut down
the server by panicking. This is to resolve an issue with Kubernetes where
traffic could be lost during upgrades. See CC-5298 for more details.

Release notes: None

82426: ui: new charts to statement details page r=maryliag a=maryliag

The overview for a statement details page now
shows charts instead of just the mean value for:

- Execution and Planning Time
- Rows Processed
- Execution Retries
- Execution Count
- Contention

https://www.loom.com/share/2b6d0311e61a433d81ad37b8b62fba7d

<img width="1566" alt="Screen Shot 2022-06-06 at 6 14 41 PM" src="https://user-images.githubusercontent.com/1017486/172258188-d0367377-140e-4863-8ecf-f06f14a0e277.png">
<img width="1576" alt="Screen Shot 2022-06-03 at 4 53 49 PM" src="https://user-images.githubusercontent.com/1017486/171951030-0535484d-f4f2-4af4-aa50-f53247951f8c.png">
<img width="1575" alt="Screen Shot 2022-06-03 at 4 54 02 PM" src="https://user-images.githubusercontent.com/1017486/171951037-af6bf17d-0128-49dc-861a-250adaa0f383.png">


Fixes #74517

Release note (ui change): Statement Details page now shows charts
for: Execution and Planning Time, Rows Processed, Execution Retries,
Execution Count and Contention.

82561: sql: put connExecutor's AutoRetry fields into txnState's mutex r=ajwerner a=RichardJCai

Auto retry variables could be used outside the connExecutor's goroutine in calls to
serialize. If this is the case, the field should be in txnState's mutex struct.

Release note: None

Fixes #82506
Fixes #78178

82658: storage: stop writing SST table properties r=jbowens a=erikgrinaker

**util/hlc: add `Timestamp.WallNext()`**

Release note: None

**storage: stop writing SST table properties**

Previously, SSTables were written with the table properties
`crdb.ts.min` and `crdb.ts.max` tracking the minimum and maximum
timestamps in the SSTable. This was used to filter SSTables via the
`MVCCIterator` options `MinTimestampHint` and `MaxTimestampHint`,
aka time-bound iterators.

In 22.1, we additionally began writing the block property
`MVCCTimeInterval` containing the minimum and maximum timestamps for
each SSTable block, aggregated up to the SSTable. Time-bound iterators
filter on this block property in addition to the table properties.

In 22.2, we will be introducing Pebble range keys. Table properties do
not support range keys, so we must migrate entirely to block properties.

This patch therefore stops writing table properties entirely. This
ensures that SSTables containing range keys (introduced in 22.2) will
not have table properties -- conversely, SSTables that do have table
properties can only contain point keys. By doing so, we avoid
erroneously filtering out SSTables with range keys that satisfy the time
predicate but were not taken into account by table properties. A later
patch will add block property collectors for range keys.

We still respect table properties while filtering, to maintain
compatibility with SSTables written by 21.2 nodes and earlier. We must
retain this backwards compatibility as long as it is possible for a
cluster to have SSTables written by older nodes, which can happen e.g.
if a cluster is upgraded from 21.2 to 22.2 in rapid succession.

The change from table to block properties slightly changes the filtering
semantics, as block properties discard the logical timestamp component.
This should not be an issue, since timestamp hints are inclusive, and
TBIs are allowed to emit false positives.

This patch also removes the `IteratorStats.TimeBoundNumSSTs` field, as
block property filters do not update it.

Touches #82596.

Release note: None
  
**storage: remove `pebbleDeleteRangeCollector`**

This was a noop property collector no longer in use.

Release note: None

82664: roachtest: skip flaky costfuzz r=meridional a=meridional

Github issue: #81717

Release note: None

Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: PJ Tatlow <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Ye Ji <[email protected]>
@craig craig bot closed this as completed in 9dfdf1b Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants