-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #53963
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the encoding logic and math LGTM, but i more or less skimmed the opt / rowexec bits.
one question about extending this to support knn.
Reviewed 26 of 26 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @sumeerbhola)
pkg/geo/geoindex/s2_geometry_index.go, line 197 at r1 (raw file):
bboxRef := g.BoundingBoxRef() if bboxRef == nil && len(keys) > 0 { return keys, bbox, errors.Errorf("non-empty geometry should have bounding box")
nit: maybe AssertionFailedf
(as with others errors.Error in this PR)?
pkg/sql/opt/invertedexpr/geo_expression.go, line 68 at r1 (raw file):
} // Avoid per-span heap allocations. b := make([]byte, 0, len(ukSpans)*(2*encoding.MaxVarintLen+2))
noob question: why the +2
pkg/util/encoding/encoding.go, line 865 at r1 (raw file):
encodeTwoFloats := loX == hiX && loY == hiY if encodeTwoFloats { b = append(b, byte(geoInvertedTwoFloats))
side question: do you think this will help knn in any way? will this help/hinder sorting?
0c02c57
to
9a55c81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)
pkg/geo/geoindex/s2_geometry_index.go, line 197 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
nit: maybe
AssertionFailedf
(as with others errors.Error in this PR)?
Done
pkg/sql/opt/invertedexpr/geo_expression.go, line 68 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
noob question: why the +2
It is the 1 byte marker x 2 (two keys in a span). I've added a comment.
pkg/util/encoding/encoding.go, line 865 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
side question: do you think this will help knn in any way? will this help/hinder sorting?
The sort ordering of all the rows in the index that share the same cellid is not going to be helpful, IMO. If we are reading a particular cellid, because the search has expanded to that cellid, we will want to read all the rows for that cellid.
pkg/util/encoding/encoding.go, line 1572 at r1 (raw file):
ArrayKeyDesc Type = 23 // Array key encoded descendingly Box2D Type = 24 geoInverted Type = 25 // Not exported since only used for pretty-printing
I've undone this pretty printing change since it is hard to make it work: the code also pretty prints cases where we have a span for reading, which only has the cellid. And I can't hack it to just read the uvarint if there is nothing left in the []byte
because there is a fallback path in PrettyPrintValue
when it isn't a uvarint that tries to undo the effect of PrefixEnd
.
3819ab4
to
b6ff8a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just some nits below...
Reviewed 22 of 26 files at r1, 11 of 11 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @otan and @sumeerbhola)
pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):
// index that has a true value. PreFilter must only be called when CanPreFilter() // is true. This batching of pre-filter application for a looked up row allows // enc to be decoded once.
This is all a bit abstract -- could you make it more concrete with an example?
pkg/sql/opt/invertedidx/geo.go, line 713 at r2 (raw file):
} srid = g.SRID() }
probably still worth adding a default
case with a panic just to be safe...
pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):
// PreFilter pre-filters a retrieved inverted values against a set // of pre-filter states. func (p *PreFilterer) PreFilter(
What does the state returned in result
mean v. the single boolean return value?
pkg/sql/opt/invertedidx/geo.go, line 745 at r2 (raw file):
} rv := false for i := range preFilters {
why would we have multiple preFilters
?
pkg/sql/opt/invertedidx/geo_test.go, line 525 at r2 (raw file):
shapes: geoShapes, expected: [][]bool{ {true, false, false, true}, {false, true, false, true}, {false, false, true, false},
[nit] I think this would be easier to read if the four rows were stacked on top of each other like a matrix
Also, would be good to add a comment somewhere explaining what this is testing. It seems like you're comparing each of the four shapes above with each other, but making that explicit would help.
Are there any tests here where the preFilter returns true but the final result would be false? Might help to note that with a comment.
pkg/sql/rowexec/inverted_expr_evaluator.go, line 373 at r2 (raw file):
append(nextSpan.exprAndSetIndexList, pendingSpans[i].exprAndSetIndexList...) } sort.Sort(exprAndSetIndexSorter(nextSpan.exprAndSetIndexList))
[nit] add a comment to explain what this block of code is doing
pkg/sql/rowexec/inverted_expr_evaluator.go, line 482 at r2 (raw file):
// TODO(sumeer): if this will be called in non-decreasing order of enc, // use that to optimize the binary search. func (b *batchedInvertedExprEvaluator) prepareAddIndexRow(
[nit] add a comment to explain what is happening here in prepareAddIndexRow
v. addIndexRow
. Also explain what the return value means.
b6ff8a7
to
0d83c18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
I've addressed the comments.
I need to get to the bottom of the failure in TestStats/inverted-geo
caused by a plan change -- it has chosen to use the invertedFilterer. https://teamcity.cockroachdb.com/viewLog.html?buildId=2249375&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)
pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
This is all a bit abstract -- could you make it more concrete with an example?
Done.
pkg/sql/opt/invertedidx/geo.go, line 713 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
probably still worth adding a
default
case with a panic just to be safe...
Done.
pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
What does the state returned in
result
mean v. the single boolean return value?
Added a comment. It represents whether there is at least one result
with a true value.
pkg/sql/opt/invertedidx/geo.go, line 745 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
why would we have multiple
preFilters
?
Each of the left-side in a batch has a pre-filter extracted for it (the bounding box and SRID).
There is a comment where DatumsToInvertedExpr
is declared that motivates this: "This batching of pre-filter application for a looked up row allows enc to be decoded once."
pkg/sql/opt/invertedidx/geo_test.go, line 525 at r2 (raw file):
[nit] I think this would be easier to read if the four rows were stacked on top of each other like a matrix
Done
Also, would be good to add a comment somewhere explaining what this is testing. It seems like you're comparing each of the four shapes above with each other, but making that explicit would help.
Added a comment
Are there any tests here where the preFilter returns true but the final result would be false? Might help to note that with a comment.
I've added a test where all the results are false, so the return value would be false.
pkg/sql/rowexec/inverted_expr_evaluator.go, line 373 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
[nit] add a comment to explain what this block of code is doing
Done
pkg/sql/rowexec/inverted_expr_evaluator.go, line 482 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
[nit] add a comment to explain what is happening here in
prepareAddIndexRow
v.addIndexRow
. Also explain what the return value means.
Done. Also removed the unused parameter from addIndexRow
58cd34f
to
46635a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure in TestStats/inverted-geo caused by a plan change -- it has chosen to use the invertedFilterer
it is fixed now.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r3, 5 of 5 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @sumeerbhola)
pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):
Previously, sumeerbhola wrote…
Done.
It might also help to give some hint about what the interface{}
objects contain.
pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):
Previously, sumeerbhola wrote…
Added a comment. It represents whether there is at least one
result
with a true value.
Ah, didn't realize this was the interface implementation. I think it's more common to have a comment like // PreFilter is part of the DatumsToInvertedExpr interface.
, and put all the relevant info on the interface method.
pkg/sql/opt/invertedidx/geo.go, line 714 at r3 (raw file):
srid = g.SRID() default: panic(errors.Errorf("datum of unhandled type: ", d))
I think all these panic errors should be AssertionFailedf
46635a3
to
b30d0da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)
pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
It might also help to give some hint about what the
interface{}
objects contain.
Done
pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Ah, didn't realize this was the interface implementation. I think it's more common to have a comment like
// PreFilter is part of the DatumsToInvertedExpr interface.
, and put all the relevant info on the interface method.
The same thing is stated in DatumsToInvertedExpr
. I've updated the comment to state that PreFilterer
does not implement the full interface (which is why the comment was not saying "PerFilter is part of the ..." -- I thought it may mislead the reader).
pkg/sql/opt/invertedidx/geo.go, line 714 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I think all these panic errors should be AssertionFailedf
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @sumeerbhola)
pkg/sql/opt/invertedidx/geo.go, line 727 at r5 (raw file):
// PreFilter pre-filters a retrieved inverted value against a set of // pre-filter states. The function signature matches the PreFilter function of // the DatumsToInvertedExpr interface (PreFilterer does not implment the full
implment -> implement
…ed column The bounding box is encoded after the cellid, and used to perform "pre-filtering", i.e., filtering after the inverted row is fetched but before it is added to various set expressions. The fundamental weakness with the cell covering approach is that the coverings for convex shapes are typically worse than a bounding box and we pay significant cost in doing a lookup join to eliminate the false positives. This attempts to reduce those false positives for the lookup join, and part of the inverted join (the invertedJoiner can avoid adding a row to the RowContainer). For the canonical nyc st_intersects and st_dwithin queries the joinReader would see a much higher input row count than the output due to false positives. Specifically, - st_intersects: input 3110, output 490 - st_dwithin: input 23183, output 8606 With this change the input counts are 1064 and 11177. There is the additional cost of reading extra bytes and applying the bounding box filter, but it is more than offset by the gains (see comparison numbers below). This is a format change for the index, so is easier to do before the launch of spatial features. The encoding here is trivial, and consumes 17 bytes for points and 33 bytes for others. An experiment with a more complicated encoding that converted to integers while bounding the relative error did not show any improvement wrt the following benchmarks. There are TODOs for followup PRs: - Do such pre-filtering for invertedFilterer. This will need a spec change. - Support pre-filtering for expressions containing more than one function. Comparison numbers: old is master name old ms new ms delta Test/nyc/nyc-intersects 80.0 ±23% 69.0 ±30% -13.76% (p=0.000 n=49+48) Test/nyc/nyc-dwithin 225 ±13% 169 ± 8% -24.88% (p=0.000 n=46+45) Test/nyc/nyc-coveredby 84.1 ±11% 69.0 ±14% -17.96% (p=0.000 n=50+49) Test/nyc/nyc-expand-blocks 85.7 ± 6% 84.6 ± 8% -1.22% (p=0.030 n=45+47) Test/nyc/nyc-dwithin-and-dfullywithin 195 ± 7% 195 ± 6% ~ (p=1.000 n=46+49) Test/postgis_geometry_tutorial/11.2b 1.66 ±24% 1.53 ±45% -7.87% (p=0.006 n=50+50) Test/postgis_geometry_tutorial/11.6 1.61 ±19% 1.66 ±32% ~ (p=0.329 n=46+50) Test/postgis_geometry_tutorial/12.1.2 0.89 ±26% 0.91 ±26% ~ (p=0.532 n=46+50) Test/postgis_geometry_tutorial/12.2.3 1.33 ±20% 1.38 ±36% ~ (p=0.308 n=49+49) Test/postgis_geometry_tutorial/12.2.4 1.30 ±27% 1.29 ±18% ~ (p=0.746 n=49+45) Test/postgis_geometry_tutorial/13.0 1.49 ±36% 1.55 ±33% ~ (p=0.141 n=49+50) Test/postgis_geometry_tutorial/13.1a 233 ± 7% 197 ± 7% -15.45% (p=0.000 n=48+47) Test/postgis_geometry_tutorial/13.1c 24.9 ±15% 19.4 ±18% -22.17% (p=0.000 n=50+48) Test/postgis_geometry_tutorial/13.2 309 ± 4% 257 ± 7% -16.92% (p=0.000 n=43+47) Test/postgis_geometry_tutorial/14.1a 1.51 ±15% 1.70 ±16% +12.93% (p=0.000 n=48+44) Test/postgis_geometry_tutorial/14.2b 6.28 ±22% 6.07 ±19% ~ (p=0.050 n=50+45) Test/postgis_geometry_tutorial/14.2c 3.70 ±14% 3.47 ±10% -5.98% (p=0.000 n=48+46) Test/postgis_geometry_tutorial/14.3c 31.9 ±11% 25.9 ±12% -18.60% (p=0.000 n=48+49) Test/postgis_geometry_tutorial/15.0 1.50 ±20% 1.50 ±20% ~ (p=0.860 n=48+49) Release justification: This is an incompatible format change, and discussions with engineering and product indicated a preference to do this now before we launch spatial features Release note: None
b30d0da
to
dd8b81c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)
pkg/sql/opt/invertedidx/geo.go, line 727 at r5 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
implment -> implement
Done.
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
…ed column
The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.
The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).
For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,
With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).
This is a format change for the index, so is easier to do
before the launch of spatial features.
The encoding here is trivial, and consumes 17 bytes for points
and 33 bytes for others. An experiment with a more complicated
encoding that converted to integers while bounding the
relative error did not show any improvement wrt the following
benchmarks.
There are TODOs for followup PRs:
a spec change.
one function.
Comparison numbers:
old is master
Release justification: This is an incompatible format change,
and discussions with engineering and product indicated a
preference to do this now before we launch spatial features
Release note: None