-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geoindex,invertedidx,rowexec,...: add bbox to inverted column #53776
Conversation
The bounding box is encoded after the cellid, and used to perform "pre-filtering", i.e., filtering after the inverted row is fetched but before it is added to various set expressions. The fundamental weakness with the cell covering approach is that the coverings for convex shapes are typically worse than a bounding box and we pay significant cost in doing a lookup join to eliminate the false positives. This attempts to reduce those false positives for the lookup join, and part of the inverted join (the invertedJoiner can avoid adding a row to the RowContainer). The code is incomplete in that it lacks new tests and doesn't do this pre-filtering for Geography and the invertedFilterer. There is also a question of whether this information should be part of the inverted column or placed in an additional column (latter would require generalizing the structure of inverted indexes). For the canonical nyc st_intersects and st_dwithin queries the joinReader would see a much higher input row count than the output due to false positives. Specifically, - st_intersects: input 3110, output 490 - st_dwithin: input 23183, output 8606 With this change the input counts are 1064 and 11177. There is the additional cost of reading extra bytes and applying the bounding box filter, but it is more than offset by the gains (see comparison numbers below). This is a format change for the index, so we will need to support both formats. Also, the encoding here is trivial, and consumes 32 bytes. We can do better wrt bytes but will need to measure how that affects decoding cost. Fuller comparison numbers: old is master name old ms new ms delta Test/nyc/nyc-intersects 80.0 ±23% 69.0 ±30% -13.76% (p=0.000 n=49+48) Test/nyc/nyc-dwithin 225 ±13% 169 ± 8% -24.88% (p=0.000 n=46+45) Test/nyc/nyc-coveredby 84.1 ±11% 69.0 ±14% -17.96% (p=0.000 n=50+49) Test/nyc/nyc-expand-blocks 85.7 ± 6% 84.6 ± 8% -1.22% (p=0.030 n=45+47) Test/nyc/nyc-dwithin-and-dfullywithin 195 ± 7% 195 ± 6% ~ (p=1.000 n=46+49) Test/postgis_geometry_tutorial/11.2b 1.66 ±24% 1.53 ±45% -7.87% (p=0.006 n=50+50) Test/postgis_geometry_tutorial/11.6 1.61 ±19% 1.66 ±32% ~ (p=0.329 n=46+50) Test/postgis_geometry_tutorial/12.1.2 0.89 ±26% 0.91 ±26% ~ (p=0.532 n=46+50) Test/postgis_geometry_tutorial/12.2.3 1.33 ±20% 1.38 ±36% ~ (p=0.308 n=49+49) Test/postgis_geometry_tutorial/12.2.4 1.30 ±27% 1.29 ±18% ~ (p=0.746 n=49+45) Test/postgis_geometry_tutorial/13.0 1.49 ±36% 1.55 ±33% ~ (p=0.141 n=49+50) Test/postgis_geometry_tutorial/13.1a 233 ± 7% 197 ± 7% -15.45% (p=0.000 n=48+47) Test/postgis_geometry_tutorial/13.1c 24.9 ±15% 19.4 ±18% -22.17% (p=0.000 n=50+48) Test/postgis_geometry_tutorial/13.2 309 ± 4% 257 ± 7% -16.92% (p=0.000 n=43+47) Test/postgis_geometry_tutorial/14.1a 1.51 ±15% 1.70 ±16% +12.93% (p=0.000 n=48+44) Test/postgis_geometry_tutorial/14.2b 6.28 ±22% 6.07 ±19% ~ (p=0.050 n=50+45) Test/postgis_geometry_tutorial/14.2c 3.70 ±14% 3.47 ±10% -5.98% (p=0.000 n=48+46) Test/postgis_geometry_tutorial/14.3c 31.9 ±11% 25.9 ±12% -18.60% (p=0.000 n=48+49) Test/postgis_geometry_tutorial/15.0 1.50 ±20% 1.50 ±20% ~ (p=0.860 n=48+49) Release justification: probably not for this release Release note: None
hmm, if it's an encoding change i'm not sure we can do it if it's not this release... |
I didn't look in detail at everything, but it seems like a reasonable approach and the performance improvement is impressive! Does seem worth pushing this further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are folks comfortable with it going into this release -- if so I'll start cleaning this up?
I have two followup improvements for DWithin (not in this PR)
-
an improved DWithin bbox comparison (I'll send it as a separate PR, that can precede this), which decreases the joinReader input rows from 11177 to 10001 with the following latency improvement
Test/nyc/nyc-dwithin 169 ± 8% 157 ± 6% -7.18% (p=0.000 n=45+45)
-
using the bbox buffer for constructing the index lookup, to cut down on the geos code. I had tried it earlier and it had made latency worse because it increased the number of false positives. Now with the invertedJoiner doing the bbox comparison, the false positives stay low
- joinReader rows increased from 10001 to 10007
- invertedJoiner index rows read increased from 44177 to 50499
- overall latency decreases
Test/nyc/nyc-dwithin 157 ± 6% 134 ± 6% -14.48% (p=0.000 n=45+48)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's late for a fundamental change to indexing to go in, but I'd argue that since this is 1) new functionality, 2) gives a big win, and 3) is really hard to add later, that it should go into this release.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)
let's do it.
…On Wed, Sep 2, 2020 at 1:13 PM Andy Kimball ***@***.***> wrote:
***@***.**** commented on this pull request.
It's late for a fundamental change to indexing to go in, but I'd argue
that since this is 1) new functionality, 2) gives a big win, and 3) is
really hard to add later, that it should go into this release.
*Reviewable
<https://reviewable.io/reviews/cockroachdb/cockroach/53776#-:-MGFRiXj8e12-Zgf6yYw:b-sz1bsu>*
status: [image: ] complete! 0 of 0 LGTMs obtained (waiting on
@otan <https://github.com/otan> and @rytaft <https://github.com/rytaft>)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53776 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA32FQ6HWP67EOJ7LOFIVTTSD2RP3ANCNFSM4QSAUROQ>
.
--
Oliver Tan | Member of Technical Staff
Cockroach Labs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rytaft)
pkg/util/encoding/encoding.go, line 840 at r1 (raw file):
} // TODO(sumeer): optimize encoding to use fewer bytes.
FTR, I tried a trivial version of a more complicated scheme (that needs to bound the error introduced in conversion to integers), that simply rounds down/up the lo/hi coordinates, and encodes the lo coordinates as ints and hi coordinates as int deltas wrt the lo values.
This cuts down the length from 32 bytes to ~12 bytes for the nyc_census_blocks
table. And the false positives do increase, as expected
- DWithin query: joinReader rows increased from 10007 to 10047
- Intersects query: joinReader rows increased from 1064 to 1227
But there was no latency improvement in the queries due to the byte reduction
Test/nyc/nyc-intersects 69.0 ±26% 73.1 ±24% +5.96% (p=0.000 n=50+50)
Test/nyc/nyc-dwithin 141 ± 8% 146 ± 9% +3.54% (p=0.000 n=45+47)
Test/nyc/nyc-coveredby 68.6 ± 6% 72.5 ±11% +5.59% (p=0.000 n=49+48)
So I am planning to do this simple 32 byte encoding.
The cleaned up version is #53963 |
The bounding box is encoded after the cellid, and used to perform
"pre-filtering", i.e., filtering after the inverted row is fetched
but before it is added to various set expressions.
The fundamental weakness with the cell covering
approach is that the coverings for convex shapes are typically
worse than a bounding box and we pay significant cost in doing
a lookup join to eliminate the false positives. This attempts
to reduce those false positives for the lookup join, and part
of the inverted join (the invertedJoiner can avoid adding a
row to the RowContainer).
The code is incomplete in that it lacks new tests and doesn't
do this pre-filtering for Geography and the invertedFilterer.
There is also a question of whether this information should be
part of the inverted column or placed in an additional column
(latter would require generalizing the structure of inverted
indexes).
For the canonical nyc st_intersects and st_dwithin queries
the joinReader would see a much higher input row count than
the output due to false positives. Specifically,
With this change the input counts are 1064 and 11177. There
is the additional cost of reading extra bytes and applying
the bounding box filter, but it is more than offset by the
gains (see comparison numbers below).
This is a format change for the index, so we will need to
support both formats. Also, the encoding here is
trivial, and consumes 32 bytes. We can do better wrt bytes
but will need to measure how that affects decoding cost.
Fuller comparison numbers (old is master):
Release justification: probably not for this release
Release note: None