-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: reduce statistics allocations for avg size #86460
Conversation
Every commit except for the last is from #86452. Here are some benchmarks showing the improvement for the last commit compared to the tip of #86452:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart! This makes a lot of sense.
Reviewed 1 of 85 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 646 at r4 (raw file):
if stat.ColumnCount() == 1 && stat.AvgSize() != 0 { if stats.AvgColSizes == nil { stats.AvgColSizes = make(map[opt.ColumnID]uint64)
Is there an initial capacity that would make sense? Maybe tab.ColumnCount()
?
Or instead of a map, would a slice of length tab.ColumnCount()
work?
pkg/sql/opt/memo/statistics_builder.go
line 648 at r4 (raw file):
stats.AvgColSizes = make(map[opt.ColumnID]uint64) } stats.AvgColSizes[cols.SingleColumn()] = stat.AvgSize()
Could we add the all-NULL logic here?
pkg/sql/opt/props/statistics.go
line 145 at r4 (raw file):
fmt.Fprintf(&buf, ", distinct%s=%.6g", col.Cols.String(), col.DistinctCount) fmt.Fprintf(&buf, ", null%s=%.6g", col.Cols.String(), col.NullCount) fmt.Fprintf(&buf, ", avgsize%s=%.6g", col.Cols.String(), col.AvgSize)
If s.AvgColSizes != nil
and there's an entry for this column, could we still print an avgsize? Would be useful to notice if are any gargantuan columns in a scan.
pkg/sql/opt/xform/coster.go
line 1391 at r4 (raw file):
} colSet := opt.MakeColSet(colID) colStat, ok := c.mem.RequestColStatTable(tabID, colSet)
Was this the only use of RequestColStatTable
? Should we delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 648 at r4 (raw file):
Actually, I'm not quite sure about this:
The only downside to this change is that we no longer set a columns average size to zero if it has all NULL values, according to statistics.
If the column only has NULLs, then system.table_statistics.avgSize
should be zero, as should stat.AvgSize()
. So it seems like this logic is still in place. Was there another reason you wrote this?
While we're here, though, we could fix an existing issue with average size calculation. The value in system.table_statistics.avgSize
and returned by stat.AvgSize()
is only for non-NULL values. We should probably weight it by the number of non-NULL rows, so that it is an average over all rows, instead of only being an average over non-NULL rows. I.e. it should be something like stats.AvgColSizes[cols.SingleColumn()] = stat.AvgSize() * (stat.RowCount() - stat.NullCount()) / stat.RowCount()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 646 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Is there an initial capacity that would make sense? Maybe
tab.ColumnCount()
?Or instead of a map, would a slice of length
tab.ColumnCount()
work?
Great catch! If we use column ordinals instead of column IDs we can use a slice! I'll try that out.
pkg/sql/opt/memo/statistics_builder.go
line 648 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Actually, I'm not quite sure about this:
The only downside to this change is that we no longer set a columns average size to zero if it has all NULL values, according to statistics.
If the column only has NULLs, then
system.table_statistics.avgSize
should be zero, as shouldstat.AvgSize()
. So it seems like this logic is still in place. Was there another reason you wrote this?While we're here, though, we could fix an existing issue with average size calculation. The value in
system.table_statistics.avgSize
and returned bystat.AvgSize()
is only for non-NULL values. We should probably weight it by the number of non-NULL rows, so that it is an average over all rows, instead of only being an average over non-NULL rows. I.e. it should be something likestats.AvgColSizes[cols.SingleColumn()] = stat.AvgSize() * (stat.RowCount() - stat.NullCount()) / stat.RowCount()
.
Heh, I misunderstood this comment. Now I see it was referring to this logic which does something different than what I thought - it sets the avg size to the default size if there are known to be some non-null values of the column.
I think I can mimic this logic based on your suggestion.
16b38bb
to
5e54d75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 646 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Great catch! If we use column ordinals instead of column IDs we can use a slice! I'll try that out.
Done.
pkg/sql/opt/memo/statistics_builder.go
line 648 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Heh, I misunderstood this comment. Now I see it was referring to this logic which does something different than what I thought - it sets the avg size to the default size if there are known to be some non-null values of the column.
I think I can mimic this logic based on your suggestion.
I actually don't think anything needs to change - if AvgSize is zero, it won't be added to the AvgColSizes slice (or it's entry will remain zero), and colAvgSize
will return the default.
I created a separate issue to track weighting the avg size by non-null columns. #86567
pkg/sql/opt/props/statistics.go
line 145 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
If
s.AvgColSizes != nil
and there's an entry for this column, could we still print an avgsize? Would be useful to notice if are any gargantuan columns in a scan.
AvgColSizes
will only be set for statistics built in makeTableStatistics
. These stats are never printed in optimizer plan trees - the stats of relation expressions are printed.
pkg/sql/opt/xform/coster.go
line 1391 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Was this the only use of
RequestColStatTable
? Should we delete it?
Good catch. Done.
Prior to the commit, a column's average size in bytes was included in column statistics. To fetch this average size, the coster requested an individual column statistic each scanned column. For scans and joins involving many columns, this caused many allocations of column statistics and column sets. Because we only use a column's average size when costing scans and lookup joins, there was no need to include it in column statistics. Average size doesn't propagate up an expression tree like other statistics do. This commit removes average size from column statistics and instead builds a map in `props.Statistics` that maps column IDs to average size. This significantly reduces allocations in some cases. The only downside to this change is that we no longer set a columns average size to zero if it has all NULL values, according to statistics. I believe this is a pretty rare edge case that is unlikely to significantly affect query plans, so I think the trade-off is worth it. Fixes cockroachdb#80186 Release justification: This is a minor change that improves optimizer performance. Release note: None
5e54d75
to
38cc71e
Compare
Looks like we're getting similar improvements with the slice instead of the map 👍 👍
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 73 of 85 files at r4, 20 of 45 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/opt/memo/statistics_builder.go
line 648 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
I actually don't think anything needs to change - if AvgSize is zero, it won't be added to the AvgColSizes slice (or it's entry will remain zero), and
colAvgSize
will return the default.I created a separate issue to track weighting the avg size by non-null columns. #86567
Cool!
pkg/sql/opt/props/statistics.go
line 145 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
AvgColSizes
will only be set for statistics built inmakeTableStatistics
. These stats are never printed in optimizer plan trees - the stats of relation expressions are printed.
Ah, ok.
TFTR! bors r+ |
Build succeeded: |
Prior to the commit, a column's average size in bytes was included in
column statistics. To fetch this average size, the coster requested an
individual column statistic each scanned column. For scans and joins
involving many columns, this caused many allocations of column
statistics and column sets.
Because we only use a column's average size when costing scans and
lookup joins, there was no need to include it in column statistics.
Average size doesn't propagate up an expression tree like other
statistics do.
This commit removes average size from column statistics and instead
builds a map in
props.Statistics
that maps column IDs to average size.This significantly reduces allocations in some cases.
The only downside to this change is that we no longer set a columns
average size to zero if it has all NULL values, according to statistics.
I believe this is a pretty rare edge case that is unlikely to
significantly affect query plans, so I think the trade-off is worth it.
Fixes #80186
Release justification: This is a minor change that improves optimizer
performance.
Release note: None