Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dnm] kv: expose (and use) byte batch response size limit #44341

Closed
wants to merge 2 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 24, 2020

Not much to see here in terms of code, but this is a good venue to ask the question: any reason to not be doing this?

First commit is #44339


This is a (very rough) prototype for allowing consumers of the KV API to
restrict the size of result sets returned from spans. It does so
by simply plumbing through a byte hint from the kvfetcher to the
ultimate MVCC scan.

I mostly wanted to see that this all works, and it does. In the below,
the kv table contains ~10k 1mb rows:

This PR:

root@:26257/defaultdb> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 39.029189s

Before:

root@:26257/defaultdb ?> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 56.87552s

12:58 is this PR, ~13:00 is master.

image

Finished up, this work should already de-flake #33660.

Release note: None

tbg added 2 commits January 24, 2020 13:26
TODO: due to the RocksDB/pebble modes of operation, there are
currently two ways to do anything. Thanks to the two possible
response types for batches (KVs vs bytes) there is another
factor of two. In short, I expect to have to make three more
similar changes to fully be able to implement byte hints for
scans.
TODO: testing and potentially write the code in a saner way.

----

A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to
bound the size of the response in terms of bytes rather than rows.
This commit provides the necessary functionality to pebbleMVCCScanner:
The new maxBytes field stops the scan once the size of the result
meets or exceeds the threshold (at least one key will be added,
regardless of its size).

The classic example of the problem this addresses is a table in which
each row is, say, ~1mb in size. A full table scan will currently fetch
data from KV in batches of [10k], causing at least 10GB of data held in
memory at any given moment. This sort of thing does happen in practice;
we have a long-failing roachtest cockroachdb#33660 because of just that, and
anecdotally OOMs in production clusters are with regularity caused by
individual queries consuming excessive amounts of memory at the KV
level.

Plumbing this limit into a header field on BatchRequest and down to the
engine level will allow the batching in [10k] to become byte-sized in
nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721
in general (many such requests could work together to consume a lot of
memory after all), but it's a sane baby step that might just avoid a
large portion of OOMs already.

[10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29

Release note: None
This is a (very rough) prototype for allowing consumers of the KV API to
restrict the size of result sets returned from spans. It does so
by simply plumbing through a byte hint from the kvfetcher to the
ultimate MVCC scan.

I mostly wanted to see that this all works, and it does. In the below,
the `kv` table contains ~10k 1mb rows:

This PR:
```
root@:26257/defaultdb> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 39.029189s
```

Before:
```
root@:26257/defaultdb ?> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 56.87552s
```

Finished up, this work should already de-flake cockroachdb#33660.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

Not much to see here in terms of code, but this is a good venue to ask the question: any reason to not be doing this?

I can't think of anything. Seems like pure goodness in terms of both speed and memory usage.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this was as straightforward as we had though. The approach looks good and the results in your experiment are very promising. Have you gotten a chance to run this with hotspotsplits/nodes=4?

Reviewed 7 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)


pkg/kv/dist_sender.go, line 1226 at r2 (raw file):

			}

			mightStopEarly := ba.MaxSpanRequestKeys > 0 || ba.MaxSpanResponseBytes > 0

nit: can this be pulled up above canParallelize and only computed once? I think it can also be set as the initial value of canParallelize.


pkg/sql/row/kv_batch_fetcher.go, line 229 at r2 (raw file):

	var ba roachpb.BatchRequest
	ba.Header.MaxSpanRequestKeys = f.getBatchSize()
	ba.Header.MaxSpanResponseBytes = 1 << 20 // 1MB

I'm assuming this isn't going to stay constant like this. If not, what were you envisioning as the policy to use for setting the byte limit on these requests?

  1. pull out a constant like we have with kvBatchSize?
  2. pull out a cluster setting?
  3. dynamically grow the response size like we do in getBatchSizeForIdx? (I don't think this makes sense because there's no concept of soft row-size limits)
  4. use the optimizer to make size estimates using histograms and pass that in here?
  5. something else?

pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):

		var kvData [][]byte
		var numKvs int64
		opts := engine.MVCCScanOptions{

Are we not planning on supporting byte limits for the roachpb.KEY_VALUES response format? We probably don't want to get into the business of estimating the additional overhead of the KeyValue protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.


pkg/storage/batcheval/cmd_scan.go, line 53 at r2 (raw file):

		}
		kvData, numKvs, resumeSpan, intents, err = engine.MVCCScanToBytes(
			ctx, reader, args.Key, args.EndKey, cArgs.MaxKeys, h.Timestamp,

This seems to indicate that the max arg belongs in the MVCCScanOptions struct as well.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)


pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are we not planning on supporting byte limits for the roachpb.KEY_VALUES response format? We probably don't want to get into the business of estimating the additional overhead of the KeyValue protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.

I think we should support MaxSpanResponseBytes with the KEY_VALUES format. And agree that we shouldn't try to estimate the additional overhead of the KeyValue protos. Simply passing through the MaxSpanResponseBytes setting to MVCCScan seems sufficient.

@ajwerner
Copy link
Contributor

@tbg is this going to make it into this release? I'm working on paginating ExportRequests right now. I think that'd I would like to utilize this logic or something like it. When it comes to ExportRequest there's two different parameters (1) How big should an individual file be (2) How much total data should be exported.

Given this is coming soon I think I'm going to just do (1) above and leave (2) to adopt this. In the short term I'm going to leave a TODO in CDC backfills to adopt the byte limit code rather than implementing something of my own. Relates to #44482 and the CDC portion of #39717 (comment).

@knz
Copy link
Contributor

knz commented Jan 29, 2020

It is planned for this release yes (whether it makes it will depend on how happy we are with the results)

@tbg
Copy link
Member Author

tbg commented Jan 30, 2020

@ajwerner this is going to be my must have for the coming milestone, so yes (and I'm planning to get it done early, too). Mind linking me to the pagination issue?

@tbg
Copy link
Member Author

tbg commented Jan 30, 2020

Nevermind, found the issue: #43356

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jan 30, 2020
In cockroachdb#44440 we added a `targetSize` parameter to enable pagination of export
requests. In that PR we defined the targetSize to return just before the
key that would lead to the `targetSize` being exceeded. This definition is
unfortunate when thinking about a total size limit for pagination in the
DistSender (which we'll add when cockroachdb#44341 comes in). Imagine a case where
we set a total byte limit of 1MB and a file byte limit of 1MB. That setting
should lead to at most a single file being emitted (assuming one range holds
enough data). If we used the previous definition we'd create a file which is
just below 1MB and then the DistSender would need send another request which
would contain a tiny amount of data. This brings the behavior in line with
the semantics introduced in cockroachdb#44341 for ScanRequests and is just easier to
reason about.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Feb 3, 2020
This commit adopts the API change in cockroachdb#44440 and the previous commit. It adds
a hidden cluster setting to control the target size. There's not a ton of
testing but there's some.

Further work includes:

 * Adding a mechanism to limit the number of files returned from an
   ExportRequest for use in CDC backfills. This is currently blocked
   on cockroachdb#44341.

I'm omitting a release note because the setting is hidden.

Release note: None.
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Feb 3, 2020
This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to cockroachdb#39717.

Release note: None.
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Feb 4, 2020
This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to cockroachdb#39717.

Release note: None.
craig bot pushed a commit that referenced this pull request Feb 4, 2020
44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner

This PR is motivated by the desire to get the memory usage of CDC under
control in the presense of much larger ranges. Currently when a changefeed
decides it needs to do a backfill, it breaks the spans up along range
boundaries and then fetches the data (with some parallelism) for the backfill.

The memory overhead was somewhat bounded by the range size. If we want to make
the range size dramatically larger, the memory usage would become a function of
that new, much larger range size.

Fortunately, we don't have much need for these `ExportRequest`s any more.
Another fascinating revelation of late is that the `ScanResponse` does indeed
include MVCC timestamps (not the we necessarily needed them but it's a good
idea to keep them for compatibility).

The `ScanRequest` permits currently a limit on `NumRows` which this commit
utilized. I wanted to get this change typed in anticipation of #44341 which
will provide a limit on `NumBytes`.

I retained the existing parallelism as ScanRequests with limits are not
parallel.

I would like to do some benchmarking but I feel pretty okay about the
testing we have in place already. @danhhz what do you want to see here?

Relates to #39717.

Release note: None.

44719: sql: add telemetry for uses of alter primary key r=otan a=rohany

Fixes #44716.

This PR adds a telemetry counter for uses
of the alter primary key command.

Release note (sql change): This PR adds collected telemetry
from clusters upon using the alter primary key command.

44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis

* db: additional tweak to the sstable boundary generation
* db: add memTable.logSeqNum
* db: force flushing of overlapping queued memtables during ingestion
* tool: lsm visualization tool
* db: consistently handle key decoding failure
* cmd/pebble: fix lint for maxOpsPerSec's type inference
* tool: add "find" command
* cmd/pebble: fix --rate flag
* internal/metamorphic: use the strict MemFS and add an operation to reset the DB
* db: make DB.Close() wait for any ongoing deletion of obsolete files
* sstable: encode varints directly into buf in blockWriter.store
* sstable: micro-optimize Writer.addPoint()
* sstable: minor cleanup of Writer/blockWriter
* sstable: micro-optimize blockWriter.store

Fixes #44631 

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: Peter Mattis <[email protected]>
@jordanlewis
Copy link
Member

Very excited for this!

@tbg tbg closed this Feb 24, 2020
@tbg
Copy link
Member Author

tbg commented Feb 24, 2020

#45319

@tbg tbg deleted the kvfetcher-bytes-limit branch February 24, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants