Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776

Closed
wants to merge 1 commit into from

Conversation

sumeerbhola
Copy link
Collaborator

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

The code is incomplete in that it lacks new tests and doesn't
do this pre-filtering for Geography and the invertedFilterer.
There is also a question of whether this information should be
part of the inverted column or placed in an additional column
(latter would require generalizing the structure of inverted
indexes).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,

  • st_intersects: input 3110, output 490
  • st_dwithin: input 23183, output 8606

With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).

This is a format change for the index, so we will need to
support both formats. Also, the encoding here is
trivial, and consumes 32 bytes. We can do better wrt bytes
but will need to measure how that affects decoding cost.

Fuller comparison numbers (old is master):

name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: probably not for this release
Release note: None

The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.

The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).

The code is incomplete in that it lacks new tests and doesn't
do this pre-filtering for Geography and the invertedFilterer.
There is also a question of whether this information should be
part of the inverted column or placed in an additional column
(latter would require generalizing the structure of inverted
indexes).

For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,
- st_intersects: input 3110, output 490
- st_dwithin: input 23183, output 8606
With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).

This is a format change for the index, so we will need to
support both formats. Also, the encoding here is
trivial, and consumes 32 bytes. We can do better wrt bytes
but will need to measure how that affects decoding cost.

Fuller comparison numbers:

old is master
name                                   old ms     new ms     delta
Test/nyc/nyc-intersects                80.0 ±23%  69.0 ±30%  -13.76%  (p=0.000 n=49+48)
Test/nyc/nyc-dwithin                    225 ±13%   169 ± 8%  -24.88%  (p=0.000 n=46+45)
Test/nyc/nyc-coveredby                 84.1 ±11%  69.0 ±14%  -17.96%  (p=0.000 n=50+49)
Test/nyc/nyc-expand-blocks             85.7 ± 6%  84.6 ± 8%   -1.22%  (p=0.030 n=45+47)
Test/nyc/nyc-dwithin-and-dfullywithin   195 ± 7%   195 ± 6%     ~     (p=1.000 n=46+49)
Test/postgis_geometry_tutorial/11.2b   1.66 ±24%  1.53 ±45%   -7.87%  (p=0.006 n=50+50)
Test/postgis_geometry_tutorial/11.6    1.61 ±19%  1.66 ±32%     ~     (p=0.329 n=46+50)
Test/postgis_geometry_tutorial/12.1.2  0.89 ±26%  0.91 ±26%     ~     (p=0.532 n=46+50)
Test/postgis_geometry_tutorial/12.2.3  1.33 ±20%  1.38 ±36%     ~     (p=0.308 n=49+49)
Test/postgis_geometry_tutorial/12.2.4  1.30 ±27%  1.29 ±18%     ~     (p=0.746 n=49+45)
Test/postgis_geometry_tutorial/13.0    1.49 ±36%  1.55 ±33%     ~     (p=0.141 n=49+50)
Test/postgis_geometry_tutorial/13.1a    233 ± 7%   197 ± 7%  -15.45%  (p=0.000 n=48+47)
Test/postgis_geometry_tutorial/13.1c   24.9 ±15%  19.4 ±18%  -22.17%  (p=0.000 n=50+48)
Test/postgis_geometry_tutorial/13.2     309 ± 4%   257 ± 7%  -16.92%  (p=0.000 n=43+47)
Test/postgis_geometry_tutorial/14.1a   1.51 ±15%  1.70 ±16%  +12.93%  (p=0.000 n=48+44)
Test/postgis_geometry_tutorial/14.2b   6.28 ±22%  6.07 ±19%     ~     (p=0.050 n=50+45)
Test/postgis_geometry_tutorial/14.2c   3.70 ±14%  3.47 ±10%   -5.98%  (p=0.000 n=48+46)
Test/postgis_geometry_tutorial/14.3c   31.9 ±11%  25.9 ±12%  -18.60%  (p=0.000 n=48+49)
Test/postgis_geometry_tutorial/15.0    1.50 ±20%  1.50 ±20%     ~     (p=0.860 n=48+49)

Release justification: probably not for this release
Release note: None
@sumeerbhola sumeerbhola requested review from rytaft, otan and a team September 1, 2020 18:13
@sumeerbhola sumeerbhola requested a review from a team as a code owner September 1, 2020 18:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan
Copy link
Contributor

otan commented Sep 1, 2020

hmm, if it's an encoding change i'm not sure we can do it if it's not this release...
edit: oh i see we're planning to make it inverted compatible

@rytaft
Copy link
Collaborator

rytaft commented Sep 1, 2020

I didn't look in detail at everything, but it seems like a reasonable approach and the performance improvement is impressive! Does seem worth pushing this further.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are folks comfortable with it going into this release -- if so I'll start cleaning this up?

I have two followup improvements for DWithin (not in this PR)

  • an improved DWithin bbox comparison (I'll send it as a separate PR, that can precede this), which decreases the joinReader input rows from 11177 to 10001 with the following latency improvement
    Test/nyc/nyc-dwithin 169 ± 8% 157 ± 6% -7.18% (p=0.000 n=45+45)

  • using the bbox buffer for constructing the index lookup, to cut down on the geos code. I had tried it earlier and it had made latency worse because it increased the number of false positives. Now with the invertedJoiner doing the bbox comparison, the false positives stay low

    • joinReader rows increased from 10001 to 10007
    • invertedJoiner index rows read increased from 44177 to 50499
    • overall latency decreases
      Test/nyc/nyc-dwithin 157 ± 6% 134 ± 6% -14.48% (p=0.000 n=45+48)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's late for a fundamental change to indexing to go in, but I'd argue that since this is 1) new functionality, 2) gives a big win, and 3) is really hard to add later, that it should go into this release.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)

@otan
Copy link
Contributor

otan commented Sep 2, 2020 via email

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)


pkg/util/encoding/encoding.go, line 840 at r1 (raw file):

}

// TODO(sumeer): optimize encoding to use fewer bytes.

FTR, I tried a trivial version of a more complicated scheme (that needs to bound the error introduced in conversion to integers), that simply rounds down/up the lo/hi coordinates, and encodes the lo coordinates as ints and hi coordinates as int deltas wrt the lo values.
This cuts down the length from 32 bytes to ~12 bytes for the nyc_census_blocks table. And the false positives do increase, as expected

  • DWithin query: joinReader rows increased from 10007 to 10047
  • Intersects query: joinReader rows increased from 1064 to 1227

But there was no latency improvement in the queries due to the byte reduction

Test/nyc/nyc-intersects                69.0 ±26%  73.1 ±24%  +5.96%  (p=0.000 n=50+50)
Test/nyc/nyc-dwithin                    141 ± 8%   146 ± 9%  +3.54%  (p=0.000 n=45+47)
Test/nyc/nyc-coveredby                 68.6 ± 6%  72.5 ±11%  +5.59%  (p=0.000 n=49+48)

So I am planning to do this simple 32 byte encoding.

@sumeerbhola
Copy link
Collaborator Author

The cleaned up version is #53963
I am closing this.

@sumeerbhola sumeerbhola closed this Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants