release-20.2: geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #54128

sumeerbhola · 2020-09-09T15:40:40Z

Backport 1/1 commits from #53963.

/cc @cockroachdb/release

…ed column

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,

st_intersects: input 3110, output 490
st_dwithin: input 23183, output 8606
With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).

This is a format change for the index, so is easier to do
before the launch of spatial features.

The encoding here is trivial, and consumes 17 bytes for points
and 33 bytes for others. An experiment with a more complicated
encoding that converted to integers while bounding the
relative error did not show any improvement wrt the following
benchmarks.

There are TODOs for followup PRs:

Do such pre-filtering for invertedFilterer. This will need
a spec change.
Support pre-filtering for expressions containing more than
one function.

Comparison numbers:

old is master

name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: This is an incompatible format change,
and discussions with engineering and product indicated a
preference to do this now before we launch spatial features

Release note: None

…ed column The bounding box is encoded after the cellid, and used to perform "pre-filtering", i.e., filtering after the inverted row is fetched but before it is added to various set expressions. The fundamental weakness with the cell covering approach is that the coverings for convex shapes are typically worse than a bounding box and we pay significant cost in doing a lookup join to eliminate the false positives. This attempts to reduce those false positives for the lookup join, and part of the inverted join (the invertedJoiner can avoid adding a row to the RowContainer). For the canonical nyc st_intersects and st_dwithin queries the joinReader would see a much higher input row count than the output due to false positives. Specifically, - st_intersects: input 3110, output 490 - st_dwithin: input 23183, output 8606 With this change the input counts are 1064 and 11177. There is the additional cost of reading extra bytes and applying the bounding box filter, but it is more than offset by the gains (see comparison numbers below). This is a format change for the index, so is easier to do before the launch of spatial features. The encoding here is trivial, and consumes 17 bytes for points and 33 bytes for others. An experiment with a more complicated encoding that converted to integers while bounding the relative error did not show any improvement wrt the following benchmarks. There are TODOs for followup PRs: - Do such pre-filtering for invertedFilterer. This will need a spec change. - Support pre-filtering for expressions containing more than one function. Comparison numbers: old is master name old ms new ms delta Test/nyc/nyc-intersects 80.0 ±23% 69.0 ±30% -13.76% (p=0.000 n=49+48) Test/nyc/nyc-dwithin 225 ±13% 169 ± 8% -24.88% (p=0.000 n=46+45) Test/nyc/nyc-coveredby 84.1 ±11% 69.0 ±14% -17.96% (p=0.000 n=50+49) Test/nyc/nyc-expand-blocks 85.7 ± 6% 84.6 ± 8% -1.22% (p=0.030 n=45+47) Test/nyc/nyc-dwithin-and-dfullywithin 195 ± 7% 195 ± 6% ~ (p=1.000 n=46+49) Test/postgis_geometry_tutorial/11.2b 1.66 ±24% 1.53 ±45% -7.87% (p=0.006 n=50+50) Test/postgis_geometry_tutorial/11.6 1.61 ±19% 1.66 ±32% ~ (p=0.329 n=46+50) Test/postgis_geometry_tutorial/12.1.2 0.89 ±26% 0.91 ±26% ~ (p=0.532 n=46+50) Test/postgis_geometry_tutorial/12.2.3 1.33 ±20% 1.38 ±36% ~ (p=0.308 n=49+49) Test/postgis_geometry_tutorial/12.2.4 1.30 ±27% 1.29 ±18% ~ (p=0.746 n=49+45) Test/postgis_geometry_tutorial/13.0 1.49 ±36% 1.55 ±33% ~ (p=0.141 n=49+50) Test/postgis_geometry_tutorial/13.1a 233 ± 7% 197 ± 7% -15.45% (p=0.000 n=48+47) Test/postgis_geometry_tutorial/13.1c 24.9 ±15% 19.4 ±18% -22.17% (p=0.000 n=50+48) Test/postgis_geometry_tutorial/13.2 309 ± 4% 257 ± 7% -16.92% (p=0.000 n=43+47) Test/postgis_geometry_tutorial/14.1a 1.51 ±15% 1.70 ±16% +12.93% (p=0.000 n=48+44) Test/postgis_geometry_tutorial/14.2b 6.28 ±22% 6.07 ±19% ~ (p=0.050 n=50+45) Test/postgis_geometry_tutorial/14.2c 3.70 ±14% 3.47 ±10% -5.98% (p=0.000 n=48+46) Test/postgis_geometry_tutorial/14.3c 31.9 ±11% 25.9 ±12% -18.60% (p=0.000 n=48+49) Test/postgis_geometry_tutorial/15.0 1.50 ±20% 1.50 ±20% ~ (p=0.860 n=48+49) Release justification: This is an incompatible format change, and discussions with engineering and product indicated a preference to do this now before we launch spatial features Release note: None

cockroach-teamcity · 2020-09-09T15:40:48Z

This change is

rytaft

Should you add this here: #53662?

Reviewed 34 of 34 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @otan)

sumeerbhola · 2020-09-09T16:43:43Z

Added to #53662

sumeerbhola · 2020-09-14T22:08:26Z

I'm assuming the check mark in #53662 means I can merge this.

sumeerbhola requested review from rytaft and otan September 9, 2020 15:40

sumeerbhola requested a review from a team as a code owner September 9, 2020 15:40

rytaft approved these changes Sep 9, 2020

View reviewed changes

isaactwong mentioned this pull request Sep 9, 2020

20.2 Release Backports Request List #53662

Closed

40 tasks

sumeerbhola merged commit 7d5e070 into cockroachdb:release-20.2 Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.2: geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #54128

release-20.2: geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #54128

sumeerbhola commented Sep 9, 2020

cockroach-teamcity commented Sep 9, 2020

rytaft left a comment

sumeerbhola commented Sep 9, 2020

sumeerbhola commented Sep 14, 2020

release-20.2: geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #54128

release-20.2: geoindex,invertedidx,rowexec,encoding,...: add bounding box to invert… #54128

Conversation

sumeerbhola commented Sep 9, 2020

cockroach-teamcity commented Sep 9, 2020

rytaft left a comment

Choose a reason for hiding this comment

sumeerbhola commented Sep 9, 2020

sumeerbhola commented Sep 14, 2020