geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

sumeerbhola · 2020-09-01T18:13:20Z

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

The code is incomplete in that it lacks new tests and doesn't
do this pre-filtering for Geography and the invertedFilterer.
There is also a question of whether this information should be
part of the inverted column or placed in an additional column
(latter would require generalizing the structure of inverted
indexes).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,

st_intersects: input 3110, output 490
st_dwithin: input 23183, output 8606

With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).

This is a format change for the index, so we will need to
support both formats. Also, the encoding here is
trivial, and consumes 32 bytes. We can do better wrt bytes
but will need to measure how that affects decoding cost.

Fuller comparison numbers (old is master):

name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: probably not for this release
Release note: None

The bounding box is encoded after the cellid, and used to perform "pre-filtering", i.e., filtering after the inverted row is fetched but before it is added to various set expressions. The fundamental weakness with the cell covering approach is that the coverings for convex shapes are typically worse than a bounding box and we pay significant cost in doing a lookup join to eliminate the false positives. This attempts to reduce those false positives for the lookup join, and part of the inverted join (the invertedJoiner can avoid adding a row to the RowContainer). The code is incomplete in that it lacks new tests and doesn't do this pre-filtering for Geography and the invertedFilterer. There is also a question of whether this information should be part of the inverted column or placed in an additional column (latter would require generalizing the structure of inverted indexes). For the canonical nyc st_intersects and st_dwithin queries the joinReader would see a much higher input row count than the output due to false positives. Specifically, - st_intersects: input 3110, output 490 - st_dwithin: input 23183, output 8606 With this change the input counts are 1064 and 11177. There is the additional cost of reading extra bytes and applying the bounding box filter, but it is more than offset by the gains (see comparison numbers below). This is a format change for the index, so we will need to support both formats. Also, the encoding here is trivial, and consumes 32 bytes. We can do better wrt bytes but will need to measure how that affects decoding cost. Fuller comparison numbers: old is master name old ms new ms delta Test/nyc/nyc-intersects 80.0 ±23% 69.0 ±30% -13.76% (p=0.000 n=49+48) Test/nyc/nyc-dwithin 225 ±13% 169 ± 8% -24.88% (p=0.000 n=46+45) Test/nyc/nyc-coveredby 84.1 ±11% 69.0 ±14% -17.96% (p=0.000 n=50+49) Test/nyc/nyc-expand-blocks 85.7 ± 6% 84.6 ± 8% -1.22% (p=0.030 n=45+47) Test/nyc/nyc-dwithin-and-dfullywithin 195 ± 7% 195 ± 6% ~ (p=1.000 n=46+49) Test/postgis_geometry_tutorial/11.2b 1.66 ±24% 1.53 ±45% -7.87% (p=0.006 n=50+50) Test/postgis_geometry_tutorial/11.6 1.61 ±19% 1.66 ±32% ~ (p=0.329 n=46+50) Test/postgis_geometry_tutorial/12.1.2 0.89 ±26% 0.91 ±26% ~ (p=0.532 n=46+50) Test/postgis_geometry_tutorial/12.2.3 1.33 ±20% 1.38 ±36% ~ (p=0.308 n=49+49) Test/postgis_geometry_tutorial/12.2.4 1.30 ±27% 1.29 ±18% ~ (p=0.746 n=49+45) Test/postgis_geometry_tutorial/13.0 1.49 ±36% 1.55 ±33% ~ (p=0.141 n=49+50) Test/postgis_geometry_tutorial/13.1a 233 ± 7% 197 ± 7% -15.45% (p=0.000 n=48+47) Test/postgis_geometry_tutorial/13.1c 24.9 ±15% 19.4 ±18% -22.17% (p=0.000 n=50+48) Test/postgis_geometry_tutorial/13.2 309 ± 4% 257 ± 7% -16.92% (p=0.000 n=43+47) Test/postgis_geometry_tutorial/14.1a 1.51 ±15% 1.70 ±16% +12.93% (p=0.000 n=48+44) Test/postgis_geometry_tutorial/14.2b 6.28 ±22% 6.07 ±19% ~ (p=0.050 n=50+45) Test/postgis_geometry_tutorial/14.2c 3.70 ±14% 3.47 ±10% -5.98% (p=0.000 n=48+46) Test/postgis_geometry_tutorial/14.3c 31.9 ±11% 25.9 ±12% -18.60% (p=0.000 n=48+49) Test/postgis_geometry_tutorial/15.0 1.50 ±20% 1.50 ±20% ~ (p=0.860 n=48+49) Release justification: probably not for this release Release note: None

cockroach-teamcity · 2020-09-01T18:13:28Z

This change is

otan · 2020-09-01T19:43:50Z

hmm, if it's an encoding change i'm not sure we can do it if it's not this release...
edit: oh i see we're planning to make it inverted compatible

rytaft · 2020-09-01T21:57:29Z

I didn't look in detail at everything, but it seems like a reasonable approach and the performance improvement is impressive! Does seem worth pushing this further.

sumeerbhola

Are folks comfortable with it going into this release -- if so I'll start cleaning this up?

I have two followup improvements for DWithin (not in this PR)

an improved DWithin bbox comparison (I'll send it as a separate PR, that can precede this), which decreases the joinReader input rows from 11177 to 10001 with the following latency improvement
Test/nyc/nyc-dwithin 169 ± 8% 157 ± 6% -7.18% (p=0.000 n=45+45)
using the bbox buffer for constructing the index lookup, to cut down on the geos code. I had tried it earlier and it had made latency worse because it increased the number of false positives. Now with the invertedJoiner doing the bbox comparison, the false positives stay low
- joinReader rows increased from 10001 to 10007
- invertedJoiner index rows read increased from 44177 to 50499
- overall latency decreases
  Test/nyc/nyc-dwithin 157 ± 6% 134 ± 6% -14.48% (p=0.000 n=45+48)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)

andy-kimball

It's late for a fundamental change to indexing to go in, but I'd argue that since this is 1) new functionality, 2) gives a big win, and 3) is really hard to add later, that it should go into this release.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)

otan · 2020-09-02T20:14:46Z

let's do it.

…

On Wed, Sep 2, 2020 at 1:13 PM Andy Kimball ***@***.***> wrote: ***@***.**** commented on this pull request. It's late for a fundamental change to indexing to go in, but I'd argue that since this is 1) new functionality, 2) gives a big win, and 3) is really hard to add later, that it should go into this release. *Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/53776#-:-MGFRiXj8e12-Zgf6yYw:b-sz1bsu>* status: [image:

] complete! 0 of 0 LGTMs obtained (waiting on @otan <https://github.com/otan> and @rytaft <https://github.com/rytaft>) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53776 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA32FQ6HWP67EOJ7LOFIVTTSD2RP3ANCNFSM4QSAUROQ> .

-- Oliver Tan | Member of Technical Staff Cockroach Labs

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)

pkg/util/encoding/encoding.go, line 840 at r1 (raw file):

}

// TODO(sumeer): optimize encoding to use fewer bytes.

FTR, I tried a trivial version of a more complicated scheme (that needs to bound the error introduced in conversion to integers), that simply rounds down/up the lo/hi coordinates, and encodes the lo coordinates as ints and hi coordinates as int deltas wrt the lo values.
This cuts down the length from 32 bytes to ~12 bytes for the nyc_census_blocks table. And the false positives do increase, as expected

DWithin query: joinReader rows increased from 10007 to 10047
Intersects query: joinReader rows increased from 1064 to 1227

But there was no latency improvement in the queries due to the byte reduction

Test/nyc/nyc-intersects                69.0 ±26%  73.1 ±24%  +5.96%  (p=0.000 n=50+50)
Test/nyc/nyc-dwithin                    141 ± 8%   146 ± 9%  +3.54%  (p=0.000 n=45+47)
Test/nyc/nyc-coveredby                 68.6 ± 6%  72.5 ±11%  +5.59%  (p=0.000 n=49+48)

So I am planning to do this simple 32 byte encoding.

sumeerbhola · 2020-09-04T16:13:06Z

The cleaned up version is #53963
I am closing this.

sumeerbhola requested review from rytaft, otan and a team September 1, 2020 18:13

sumeerbhola requested a review from a team as a code owner September 1, 2020 18:13

sumeerbhola commented Sep 2, 2020

View reviewed changes

andy-kimball reviewed Sep 2, 2020

View reviewed changes

sumeerbhola commented Sep 2, 2020

View reviewed changes

sumeerbhola closed this Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

sumeerbhola commented Sep 1, 2020

cockroach-teamcity commented Sep 1, 2020

otan commented Sep 1, 2020 •

edited

Loading

rytaft commented Sep 1, 2020

sumeerbhola left a comment

andy-kimball left a comment

otan commented Sep 2, 2020 via email

sumeerbhola left a comment

sumeerbhola commented Sep 4, 2020

geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

Conversation

sumeerbhola commented Sep 1, 2020

cockroach-teamcity commented Sep 1, 2020

otan commented Sep 1, 2020 • edited Loading

rytaft commented Sep 1, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

andy-kimball left a comment

Choose a reason for hiding this comment

otan commented Sep 2, 2020 via email

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Sep 4, 2020

otan commented Sep 1, 2020 •

edited

Loading