Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #53963

Merged
merged 1 commit into from
Sep 8, 2020

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Sep 4, 2020

…ed column

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,

  • st_intersects: input 3110, output 490
  • st_dwithin: input 23183, output 8606
    With this change the input counts are 1064 and 11177. There
    is the additional cost of reading extra bytes and applying
    the bounding box filter, but it is more than offset by the
    gains (see comparison numbers below).

This is a format change for the index, so is easier to do
before the launch of spatial features.

The encoding here is trivial, and consumes 17 bytes for points
and 33 bytes for others. An experiment with a more complicated
encoding that converted to integers while bounding the
relative error did not show any improvement wrt the following
benchmarks.

There are TODOs for followup PRs:

  • Do such pre-filtering for invertedFilterer. This will need
    a spec change.
  • Support pre-filtering for expressions containing more than
    one function.

Comparison numbers:

old is master

name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: This is an incompatible format change,
and discussions with engineering and product indicated a
preference to do this now before we launch spatial features

Release note: None

@sumeerbhola sumeerbhola requested review from rytaft, otan and a team September 4, 2020 16:11
@sumeerbhola sumeerbhola requested a review from a team as a code owner September 4, 2020 16:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encoding logic and math LGTM, but i more or less skimmed the opt / rowexec bits.
one question about extending this to support knn.

Reviewed 26 of 26 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @sumeerbhola)


pkg/geo/geoindex/s2_geometry_index.go, line 197 at r1 (raw file):

	bboxRef := g.BoundingBoxRef()
	if bboxRef == nil && len(keys) > 0 {
		return keys, bbox, errors.Errorf("non-empty geometry should have bounding box")

nit: maybe AssertionFailedf (as with others errors.Error in this PR)?


pkg/sql/opt/invertedexpr/geo_expression.go, line 68 at r1 (raw file):

	}
	// Avoid per-span heap allocations.
	b := make([]byte, 0, len(ukSpans)*(2*encoding.MaxVarintLen+2))

noob question: why the +2


pkg/util/encoding/encoding.go, line 865 at r1 (raw file):

	encodeTwoFloats := loX == hiX && loY == hiY
	if encodeTwoFloats {
		b = append(b, byte(geoInvertedTwoFloats))

side question: do you think this will help knn in any way? will this help/hinder sorting?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)


pkg/geo/geoindex/s2_geometry_index.go, line 197 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

nit: maybe AssertionFailedf (as with others errors.Error in this PR)?

Done


pkg/sql/opt/invertedexpr/geo_expression.go, line 68 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

noob question: why the +2

It is the 1 byte marker x 2 (two keys in a span). I've added a comment.


pkg/util/encoding/encoding.go, line 865 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

side question: do you think this will help knn in any way? will this help/hinder sorting?

The sort ordering of all the rows in the index that share the same cellid is not going to be helpful, IMO. If we are reading a particular cellid, because the search has expanded to that cellid, we will want to read all the rows for that cellid.


pkg/util/encoding/encoding.go, line 1572 at r1 (raw file):

	ArrayKeyDesc Type = 23 // Array key encoded descendingly
	Box2D        Type = 24
	geoInverted  Type = 25 // Not exported since only used for pretty-printing

I've undone this pretty printing change since it is hard to make it work: the code also pretty prints cases where we have a span for reading, which only has the cellid. And I can't hack it to just read the uvarint if there is nothing left in the []byte because there is a fallback path in PrettyPrintValue when it isn't a uvarint that tries to undo the effect of PrefixEnd.

@sumeerbhola sumeerbhola force-pushed the inverted_bbox branch 2 times, most recently from 3819ab4 to b6ff8a7 Compare September 4, 2020 21:06
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice work! Just some nits below...

Reviewed 22 of 26 files at r1, 11 of 11 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @otan and @sumeerbhola)


pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):

	// index that has a true value. PreFilter must only be called when CanPreFilter()
	// is true. This batching of pre-filter application for a looked up row allows
	// enc to be decoded once.

This is all a bit abstract -- could you make it more concrete with an example?


pkg/sql/opt/invertedidx/geo.go, line 713 at r2 (raw file):

		}
		srid = g.SRID()
	}

probably still worth adding a default case with a panic just to be safe...


pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):

// PreFilter pre-filters a retrieved inverted values against a set
// of pre-filter states.
func (p *PreFilterer) PreFilter(

What does the state returned in result mean v. the single boolean return value?


pkg/sql/opt/invertedidx/geo.go, line 745 at r2 (raw file):

		}
		rv := false
		for i := range preFilters {

why would we have multiple preFilters?


pkg/sql/opt/invertedidx/geo_test.go, line 525 at r2 (raw file):

			shapes:       geoShapes,
			expected: [][]bool{
				{true, false, false, true}, {false, true, false, true}, {false, false, true, false},

[nit] I think this would be easier to read if the four rows were stacked on top of each other like a matrix

Also, would be good to add a comment somewhere explaining what this is testing. It seems like you're comparing each of the four shapes above with each other, but making that explicit would help.

Are there any tests here where the preFilter returns true but the final result would be false? Might help to note that with a comment.


pkg/sql/rowexec/inverted_expr_evaluator.go, line 373 at r2 (raw file):

				append(nextSpan.exprAndSetIndexList, pendingSpans[i].exprAndSetIndexList...)
		}
		sort.Sort(exprAndSetIndexSorter(nextSpan.exprAndSetIndexList))

[nit] add a comment to explain what this block of code is doing


pkg/sql/rowexec/inverted_expr_evaluator.go, line 482 at r2 (raw file):

// TODO(sumeer): if this will be called in non-decreasing order of enc,
// use that to optimize the binary search.
func (b *batchedInvertedExprEvaluator) prepareAddIndexRow(

[nit] add a comment to explain what is happening here in prepareAddIndexRow v. addIndexRow. Also explain what the return value means.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!
I've addressed the comments.
I need to get to the bottom of the failure in TestStats/inverted-geo caused by a plan change -- it has chosen to use the invertedFilterer. https://teamcity.cockroachdb.com/viewLog.html?buildId=2249375&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)


pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This is all a bit abstract -- could you make it more concrete with an example?

Done.


pkg/sql/opt/invertedidx/geo.go, line 713 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

probably still worth adding a default case with a panic just to be safe...

Done.


pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What does the state returned in result mean v. the single boolean return value?

Added a comment. It represents whether there is at least one result with a true value.


pkg/sql/opt/invertedidx/geo.go, line 745 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why would we have multiple preFilters?

Each of the left-side in a batch has a pre-filter extracted for it (the bounding box and SRID).
There is a comment where DatumsToInvertedExpr is declared that motivates this: "This batching of pre-filter application for a looked up row allows enc to be decoded once."


pkg/sql/opt/invertedidx/geo_test.go, line 525 at r2 (raw file):

[nit] I think this would be easier to read if the four rows were stacked on top of each other like a matrix

Done

Also, would be good to add a comment somewhere explaining what this is testing. It seems like you're comparing each of the four shapes above with each other, but making that explicit would help.

Added a comment

Are there any tests here where the preFilter returns true but the final result would be false? Might help to note that with a comment.

I've added a test where all the results are false, so the return value would be false.


pkg/sql/rowexec/inverted_expr_evaluator.go, line 373 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] add a comment to explain what this block of code is doing

Done


pkg/sql/rowexec/inverted_expr_evaluator.go, line 482 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] add a comment to explain what is happening here in prepareAddIndexRow v. addIndexRow. Also explain what the return value means.

Done. Also removed the unused parameter from addIndexRow

@sumeerbhola sumeerbhola force-pushed the inverted_bbox branch 2 times, most recently from 58cd34f to 46635a3 Compare September 8, 2020 17:09
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure in TestStats/inverted-geo caused by a plan change -- it has chosen to use the invertedFilterer

it is fixed now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r3, 5 of 5 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @sumeerbhola)


pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):

Previously, sumeerbhola wrote…

Done.

It might also help to give some hint about what the interface{} objects contain.


pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):

Previously, sumeerbhola wrote…

Added a comment. It represents whether there is at least one result with a true value.

Ah, didn't realize this was the interface implementation. I think it's more common to have a comment like // PreFilter is part of the DatumsToInvertedExpr interface., and put all the relevant info on the interface method.


pkg/sql/opt/invertedidx/geo.go, line 714 at r3 (raw file):

		srid = g.SRID()
	default:
		panic(errors.Errorf("datum of unhandled type: ", d))

I think all these panic errors should be AssertionFailedf

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)


pkg/sql/opt/invertedexpr/expression.go, line 52 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It might also help to give some hint about what the interface{} objects contain.

Done


pkg/sql/opt/invertedidx/geo.go, line 725 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Ah, didn't realize this was the interface implementation. I think it's more common to have a comment like // PreFilter is part of the DatumsToInvertedExpr interface., and put all the relevant info on the interface method.

The same thing is stated in DatumsToInvertedExpr. I've updated the comment to state that PreFilterer does not implement the full interface (which is why the comment was not saying "PerFilter is part of the ..." -- I thought it may mislead the reader).


pkg/sql/opt/invertedidx/geo.go, line 714 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think all these panic errors should be AssertionFailedf

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @sumeerbhola)


pkg/sql/opt/invertedidx/geo.go, line 727 at r5 (raw file):

// PreFilter pre-filters a retrieved inverted value against a set of
// pre-filter states. The function signature matches the PreFilter function of
// the DatumsToInvertedExpr interface (PreFilterer does not implment the full

implment -> implement

…ed column

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,
- st_intersects: input 3110, output 490
- st_dwithin: input 23183, output 8606

With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).

This is a format change for the index, so is easier to do
before the launch of spatial features.

The encoding here is trivial, and consumes 17 bytes for points
and 33 bytes for others. An experiment with a more complicated
encoding that converted to integers while bounding the
relative error did not show any improvement wrt the following
benchmarks.

There are TODOs for followup PRs:
- Do such pre-filtering for invertedFilterer. This will need
  a spec change.
- Support pre-filtering for expressions containing more than
  one function.

Comparison numbers:

old is master
name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: This is an incompatible format change,
and discussions with engineering and product indicated a
preference to do this now before we launch spatial features

Release note: None
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @otan and @rytaft)


pkg/sql/opt/invertedidx/geo.go, line 727 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

implment -> implement

Done.

@sumeerbhola
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 8, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 8, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 8, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants