stats: truncate large datums when sampling for histogram #39418

rytaft · 2019-08-07T18:43:53Z

This commit adds logic to truncate long bit arrays, byte strings,
strings, and collated strings during sampling for histogram creation.
We do this to avoid using excessive memory or disk space during
sampling and storage of the final histogram.

Release note: None

cockroach-teamcity · 2019-08-07T18:44:00Z

This change is

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @justinj, @RaduBerinde, and @rytaft)

pkg/sql/stats/row_sampling.go, line 153 at r1 (raw file):

// truncateDatum truncates large datums to avoid using excessive memory or disk
// space.

[nit] Explain what "truncate" means.. Is it the closest datum of at most that size?

pkg/sql/stats/row_sampling.go, line 155 at r1 (raw file):

// space.
func truncateDatum(evalCtx *tree.EvalContext, d tree.Datum, maxBytes int) tree.Datum {
	if d.Size() <= uintptr(maxBytes) {

if we move this chceck outside of the call, we can avoid calling Size() twice in the common case (where we don't truncate)

pkg/sql/stats/row_sampling.go, line 171 at r1 (raw file):

	case *tree.DString:
		var r rune
		maxLen := uintptr(maxBytes) / unsafe.Sizeof(r)

rune is 32-bit, we are making it 4 times smaller than what we want in the common case of no unicode stuff
Also, we shouldn't need to copy since strings are immutable and can be sliced. I would do something like:

last := 0
// For strings, range skips from rune to rune and i is the byte index of the current rune.
for i := range s {
  if i > maxBytes {
    break
  }
  last = i
}
return tree.NewDString((*t)[last])

I think we can also slice and avoid the copy in the DBytes case

rytaft

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @justinj and @RaduBerinde)

pkg/sql/stats/row_sampling.go, line 153 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] Explain what "truncate" means.. Is it the closest datum of at most that size?

Done.

pkg/sql/stats/row_sampling.go, line 155 at r1 (raw file):

Previously, RaduBerinde wrote…

if we move this chceck outside of the call, we can avoid calling Size() twice in the common case (where we don't truncate)

Good idea - done.

pkg/sql/stats/row_sampling.go, line 171 at r1 (raw file):

Previously, RaduBerinde wrote…

rune is 32-bit, we are making it 4 times smaller than what we want in the common case of no unicode stuff
Also, we shouldn't need to copy since strings are immutable and can be sliced. I would do something like:
last := 0
// For strings, range skips from rune to rune and i is the byte index of the current rune.
for i := range s {
  if i > maxBytes {
    break
  }
  last = i
}
return tree.NewDString((*t)[last])
I think we can also slice and avoid the copy in the DBytes case

Nice! I didn't know range worked that way with strings.

I think we need to do the copy because otherwise the backing array of the long string/ byte slice won't be garbage collected.

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @justinj)

pkg/sql/stats/row_sampling.go, line 153 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Done.

Sorry to be nitpicky but this doesn't sound right. What we return is not a valid representation of d. It's a different value and we need to describe something about that value (that makes sense in general, not just strings). Maybe say it returns a datum that is "close" (best-effort) to the original datum. Not sure how to define "close" but maybe it's ok to be vague there

pkg/sql/stats/row_sampling.go, line 171 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Nice! I didn't know range worked that way with strings.

I think we need to do the copy because otherwise the backing array of the long string/ byte slice won't be garbage collected.

Good point, thanks for adding the comments.

rytaft

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @justinj and @RaduBerinde)

pkg/sql/stats/row_sampling.go, line 153 at r1 (raw file):

Previously, RaduBerinde wrote…

Sorry to be nitpicky but this doesn't sound right. What we return is not a valid representation of d. It's a different value and we need to describe something about that value (that makes sense in general, not just strings). Maybe say it returns a datum that is "close" (best-effort) to the original datum. Not sure how to define "close" but maybe it's ok to be vague there

Does this sound better now?

RaduBerinde

Reviewed 2 of 4 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @justinj and @rytaft)

pkg/sql/stats/row_sampling.go, line 153 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Does this sound better now?

Yeah, thanks!

This commit adds logic to truncate long bit arrays, byte strings, strings, and collated strings during sampling for histogram creation. We do this to avoid using excessive memory or disk space during sampling and storage of the final histogram. Release note: None

rytaft

Thanks!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @justinj and @RaduBerinde)

39418: stats: truncate large datums when sampling for histogram r=rytaft a=rytaft This commit adds logic to truncate long bit arrays, byte strings, strings, and collated strings during sampling for histogram creation. We do this to avoid using excessive memory or disk space during sampling and storage of the final histogram. Release note: None 39424: importccl: Direct-ingest uses two bulk adders instead of one. r=adityamaru27 a=adityamaru27 This is another change to stabilize direct ingest import before it is made the default. As a consequence of #39271, the number of files (L0 and total), along with the cumulative compaction size increased drastically. A consequence of no longer creating buckets of TableIDIndexID before flushing is that the single bulk adder would receive a mix of primary and secondary index entries. Since SSTs cannot span across the splits we inserted between index spans, it would create numerous, small secondary index SSTs along with the bigger primary index SSTs, and flush on reaching its limit (which would be often). By introducing two adders, one for ingesting primary index data, and the other for ingesting secondary index data we regain the ability to make fewer, bigger secondary index SSTs and flush less often. The peak mem is lower than what prebuffering used to hit, while the number of files (L0 and total), and the cumulative compaction size return to prebuffering levels. Some stats below for a tpcc 1k, on a 1 node cluster. With prebuffering: Total Files : 7670 L0 Files : 1848 Cumulative Compaction (GB): 24.54GiB Without prebuffering, one adder: Total Files : 22420 L0 Files : 16900 Cumulative Compaction (GB): 52.43 GiB Without prebuffering, two adders: Total Files : 6805 L0 Files : 1078 Cumulative Compaction (GB): 18.89GiB Release note: None Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Aditya Maru <[email protected]>

craig · 2019-08-08T16:23:23Z

Build succeeded

GitHub CI (Cockroach)

rytaft requested review from justinj and RaduBerinde August 7, 2019 18:43

rytaft requested a review from a team as a code owner August 7, 2019 18:43

RaduBerinde reviewed Aug 7, 2019

View reviewed changes

rytaft force-pushed the truncate-datum branch from 9380fe7 to bcc7375 Compare August 8, 2019 13:34

rytaft commented Aug 8, 2019

View reviewed changes

RaduBerinde approved these changes Aug 8, 2019

View reviewed changes

rytaft force-pushed the truncate-datum branch from bcc7375 to f897d69 Compare August 8, 2019 15:18

rytaft commented Aug 8, 2019

View reviewed changes

rytaft force-pushed the truncate-datum branch from f897d69 to e4ac74c Compare August 8, 2019 15:22

RaduBerinde approved these changes Aug 8, 2019

View reviewed changes

rytaft force-pushed the truncate-datum branch from e4ac74c to 0c524f1 Compare August 8, 2019 15:53

rytaft commented Aug 8, 2019

View reviewed changes

craig bot merged commit 0c524f1 into cockroachdb:master Aug 8, 2019

rytaft deleted the truncate-datum branch April 2, 2020 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: truncate large datums when sampling for histogram #39418

stats: truncate large datums when sampling for histogram #39418

rytaft commented Aug 7, 2019

cockroach-teamcity commented Aug 7, 2019

RaduBerinde left a comment

rytaft left a comment

RaduBerinde left a comment

rytaft left a comment

RaduBerinde left a comment

rytaft left a comment

craig bot commented Aug 8, 2019

stats: truncate large datums when sampling for histogram #39418

stats: truncate large datums when sampling for histogram #39418

Conversation

rytaft commented Aug 7, 2019

cockroach-teamcity commented Aug 7, 2019

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

craig bot commented Aug 8, 2019

Build succeeded