-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geo: order geospatial objects by Hilbert Curve index #51898
Conversation
5a70cdf
to
c751ffa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now CompareSpatialObject and SpaceCurveIndex. It is then up to callers in other packages that they need to call both in a specific order. Should we instead make a single function in this package that does the correct thing? If not, we should document somewhere the relationship between those two functions and how they should be used when comparing objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to remove the geo exceptions in sqlbase/encoded_datum_test.go so that TestEncDatumCompare tests these now.
Also someone else should review the implementation of the new geo stuff here. I'm not qualified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)
pkg/geo/geo.go, line 281 at r4 (raw file):
} // SpaceCurveIndex returns an uint64 index to use representing an index into a Space Curve.
nit: ... an index into a Space-filling Curve.
pkg/geo/geo.go, line 291 at r4 (raw file):
centerX := (bbox.BoundingBox.LoX + bbox.BoundingBox.HiX) / 2 centerY := (bbox.BoundingBox.LoY + bbox.BoundingBox.HiY) / 2 // By default, bound by MaxFloat32.
The comment says MaxFloat32
but the code is doing MaxInt32
.
pkg/geo/geo.go, line 307 at r4 (raw file):
// Find the longest dimension, add one to be inclusive. // TODO(otan): investigate storing this calculation on the projection itself for perf.
How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)
pkg/geo/geo.go, line 309 at r4 (raw file):
// TODO(otan): investigate storing this calculation on the projection itself for perf. biggestBoxLength := math.Max(bounds.MaxX-bounds.MinX, bounds.MaxY-bounds.MinY) + 1 boxSize := 1 << 40
what is the reason for this value 1 << 40
versus some other power of 2? Won't this overflow when doing (2^40)^2-1
?
It is not clear why we are trying to compute a variable box size since we normalize to [0, 1] below before multiplying by the box size. Given the hilbert library is using signed int
, it seems simpler to use a constant box size of 2^15
, so that we get a maximum return value of 2^30-1
(I didn't look at the implementation of the library to see what it does wrt overflow) which gives us the most ability to distinguish between different centers.
pkg/geo/geo.go, line 309 at r4 (raw file): Previously, sumeerbhola wrote…
A signed int in Golang is int64 on 64-bit systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)
pkg/geo/geo.go, line 540 at r1 (raw file):
Previously, mjibson (Matt Jibson) wrote…
More comments, as above.
Done.
pkg/geo/geo.go, line 291 at r4 (raw file):
Previously, sumeerbhola wrote…
The comment says
MaxFloat32
but the code is doingMaxInt32
.
Done.
pkg/geo/geo.go, line 307 at r4 (raw file):
Previously, sumeerbhola wrote…
How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)
Done.
pkg/util/encoding/encoding.go, line 1029 at r2 (raw file):
Previously, mjibson (Matt Jibson) wrote…
add a comment here for what the boolean argument is
Done.
pkg/util/encoding/encoding.go, line 1052 at r2 (raw file):
Previously, mjibson (Matt Jibson) wrote…
add a comment here for what the boolean argument is
Done.
pkg/util/encoding/encoding.go, line 1555 at r3 (raw file):
Previously, mjibson (Matt Jibson) wrote…
You have not verified that
len(b) >= 8
here, so this could panic.
Done.
pkg/geo/geo.go, line 309 at r4 (raw file): Previously, otan (Oliver Tan) wrote…
That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding stuff LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r1, 1 of 5 files at r2, 1 of 2 files at r3, 3 of 4 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)
pkg/geo/geo.go, line 309 at r4 (raw file):
Previously, otan (Oliver Tan) wrote…
That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable?
let's write our own inverse function using uint64
-- the algorithm is well documented https://en.wikipedia.org/wiki/Hilbert_curve#Applications_and_mapping_algorithms
pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):
} n := len(b) b = encodeBytesAscendingWithTerminator(b, data, ascendingGeoEscapes.escapedTerm)
How about adding a comment
// These bytes don't provide a semantically meaningful order, unlike the curveIndex, so we encode these in ascending order even though this is the EncodeGeoDescending
function.
Or maybe this was meant to be descending since I see DecodeGeoDescending()
doing descendingGeoEscapes
. If this was a typo, can you add a test that would catch it.
pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):
that inverses this. I can add a comment here, but tests do catch it. |
pkg/geo/geo.go, line 309 at r4 (raw file): Previously, sumeerbhola wrote…
Something I'm not quite understanding -- is there a reason to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)
pkg/geo/geo.go, line 309 at r4 (raw file):
Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough
It is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.
Regarding the actual algorithm, I'd rather go with the common one.
78976f7
to
2d2dd6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)
pkg/geo/geo.go, line 309 at r4 (raw file):
Previously, sumeerbhola wrote…
Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enoughIt is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.
Regarding the actual algorithm, I'd rather go with the common one.
Done.
pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):
Previously, otan (Oliver Tan) wrote…
These is a line below:
onesComplement(b[n:])
that inverses this. I can add a comment here, but tests do catch it.
Done.
pkg/util/encoding/encoding_test.go, line 1208 at r5 (raw file):
Previously, mjibson (Matt Jibson) wrote…
This breaks subtests since you can no longer run just one. Is it annoying to use
testCases[i-1]
instead? I don't feel strongly, leave it as is if you want.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)
pkg/geo/geo.go, line 309 at r7 (raw file):
yBounds := (bounds.MaxY - bounds.MinY) + 1 // We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
s/two/to/
The logic in this function seems a bit confusing and I am not sure about its correctness:
-
it isn't quite obvious why the
xBounds
andyBounds
equation above ensures that thexPos,yPos
computed later will be< N
. IfxBounds
is a power of 2 and((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX))
is 1, thenxPos=N
I think. -
If the bounds are coming from the SRID, I don't see why they can't exceed the size of
uint32
which is undesirable since thebiggestBoxLength
could exceeduint32
.
I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.
const boxLength = 1 << 32
// Add 1 to each bound so that we normalize the coordinates to [0, 1) before
// multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1].
xBounds := (bounds.MaxX - bounds.MinX) + 1
yBounds := (bounds.MaxY - bounds.MinY) + 1
xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
return hilbertInverse(boxLength, xPos, yPos)
pkg/geo/geo_test.go, line 565 at r7 (raw file):
} func TestGeographyHash(t *testing.T) {
why hash?
pkg/util/encoding/encoding.go, line 982 at r7 (raw file):
// returns the new buffer. // It is sorted by the given curve index, followed by the bytes of the spatial object. func EncodeGeoAscending(b []byte, curveIndex uint64, g *geopb.SpatialObject) ([]byte, error) {
is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)
pkg/geo/geo.go, line 309 at r7 (raw file):
Previously, sumeerbhola wrote…
s/two/to/
The logic in this function seems a bit confusing and I am not sure about its correctness:
it isn't quite obvious why the
xBounds
andyBounds
equation above ensures that thexPos,yPos
computed later will be< N
. IfxBounds
is a power of 2 and((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX))
is 1, thenxPos=N
I think.If the bounds are coming from the SRID, I don't see why they can't exceed the size of
uint32
which is undesirable since thebiggestBoxLength
could exceeduint32
.I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.
const boxLength = 1 << 32 // Add 1 to each bound so that we normalize the coordinates to [0, 1) before // multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1]. xBounds := (bounds.MaxX - bounds.MinX) + 1 yBounds := (bounds.MaxY - bounds.MinY) + 1 xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength) yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength) return hilbertInverse(boxLength, xPos, yPos)
Done.
pkg/geo/geo_test.go, line 565 at r7 (raw file):
Previously, sumeerbhola wrote…
why hash?
forgot to rename it.
pkg/util/encoding/encoding.go, line 982 at r7 (raw file):
Previously, sumeerbhola wrote…
is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?
No, that is encoded elsewhere. This is specifically for key encodings which require ordering (indexes & PKs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)
pkg/geo/geo.go, line 311 at r8 (raw file):
yBounds := (bounds.MaxY - bounds.MinY) + 1 xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength) yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].
pkg/geo/geo_test.go, line 641 at r8 (raw file):
"LINESTRING(0 0, -90 -80)", "POINT(-80 80)", "POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))",
Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)
pkg/geo/geo.go, line 311 at r8 (raw file):
Previously, sumeerbhola wrote…
// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].
Done.
pkg/geo/geo_test.go, line 641 at r8 (raw file):
Previously, sumeerbhola wrote…
Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?
Done.
This follows the [PostGIS way](https://info.crunchydata.com/blog/waiting-for-postgis-3-hilbert-geometry-sorting), but does not follow the same encoding mechanism, so we cannot compare results directly. For Geography we re-use the S2 infrastructure and for Geometry we use the google/hilbert library. Release note (sql change): When ordering by geospatial columns, it will now order by the Hilbert Space Curve index so that points which are geographically similar are clustered together.
bors r=sumeerbhola |
Canceled. |
bors r=sumeerbhola |
Build succeeded: |
This follows the PostGIS way,
but does not follow the same encoding.
Visualisation for Geometry for 10000 random points:
Visualisation for Geography 10000 random points:
Release note (sql change): When ordering by geospatial columns, it will
now order by the Hilbert Space Curve index so that points which are
geographically similar are clustered together.