geo: order geospatial objects by Hilbert Curve index #51898

otan · 2020-07-25T05:18:34Z

This follows the PostGIS way,
but does not follow the same encoding.

Visualisation for Geometry for 10000 random points:

Visualisation for Geography 10000 random points:

Release note (sql change): When ordering by geospatial columns, it will
now order by the Hilbert Space Curve index so that points which are
geographically similar are clustered together.

cockroach-teamcity · 2020-07-25T05:18:41Z

This change is

maddyblue

There's now CompareSpatialObject and SpaceCurveIndex. It is then up to callers in other packages that they need to call both in a specific order. Should we instead make a single function in this package that does the correct thing? If not, we should document somewhere the relationship between those two functions and how they should be used when comparing objects.

pkg/geo/geo.go

maddyblue

Should be able to remove the geo exceptions in sqlbase/encoded_datum_test.go so that TestEncDatumCompare tests these now.

pkg/util/encoding/encoding.go

maddyblue · 2020-07-27T19:22:16Z

Also someone else should review the implementation of the new geo stuff here. I'm not qualified.

sumeerbhola

Reviewed 3 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)

pkg/geo/geo.go, line 281 at r4 (raw file):

}

// SpaceCurveIndex returns an uint64 index to use representing an index into a Space Curve.

nit: ... an index into a Space-filling Curve.

pkg/geo/geo.go, line 291 at r4 (raw file):

	centerX := (bbox.BoundingBox.LoX + bbox.BoundingBox.HiX) / 2
	centerY := (bbox.BoundingBox.LoY + bbox.BoundingBox.HiY) / 2
	// By default, bound by MaxFloat32.

The comment says MaxFloat32 but the code is doing MaxInt32.

pkg/geo/geo.go, line 307 at r4 (raw file):

	// Find the longest dimension, add one to be inclusive.
	// TODO(otan): investigate storing this calculation on the projection itself for perf.

How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)

pkg/geo/geo.go, line 309 at r4 (raw file):

	// TODO(otan): investigate storing this calculation on the projection itself for perf.
	biggestBoxLength := math.Max(bounds.MaxX-bounds.MinX, bounds.MaxY-bounds.MinY) + 1
	boxSize := 1 << 40

what is the reason for this value 1 << 40 versus some other power of 2? Won't this overflow when doing (2^40)^2-1?

It is not clear why we are trying to compute a variable box size since we normalize to [0, 1] below before multiplying by the box size. Given the hilbert library is using signed int, it seems simpler to use a constant box size of 2^15, so that we get a maximum return value of 2^30-1 (I didn't look at the implementation of the library to see what it does wrt overflow) which gives us the most ability to distinguish between different centers.

otan · 2020-07-27T20:49:36Z

pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

what is the reason for this value 1 << 40 versus some other power of 2? Won't this overflow when doing (2^40)^2-1?

It is not clear why we are trying to compute a variable box size since we normalize to [0, 1] below before multiplying by the box size. Given the hilbert library is using signed int, it seems simpler to use a constant box size of 2^15, so that we get a maximum return value of 2^30-1 (I didn't look at the implementation of the library to see what it does wrt overflow) which gives us the most ability to distinguish between different centers.

A signed int in Golang is int64 on 64-bit systems.
Using a constant size of 15 is not great -- this is the bounding.

otan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)

pkg/geo/geo.go, line 540 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

More comments, as above.

Done.

pkg/geo/geo.go, line 291 at r4 (raw file):

Previously, sumeerbhola wrote…

The comment says MaxFloat32 but the code is doing MaxInt32.

Done.

pkg/geo/geo.go, line 307 at r4 (raw file):

Previously, sumeerbhola wrote…

How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)

Done.

pkg/util/encoding/encoding.go, line 1029 at r2 (raw file):

Previously, mjibson (Matt Jibson) wrote…

add a comment here for what the boolean argument is

Done.

pkg/util/encoding/encoding.go, line 1052 at r2 (raw file):

Previously, mjibson (Matt Jibson) wrote…

add a comment here for what the boolean argument is

Done.

pkg/util/encoding/encoding.go, line 1555 at r3 (raw file):

Previously, mjibson (Matt Jibson) wrote…

You have not verified that len(b) >= 8 here, so this could panic.

Done.

otan · 2020-07-27T21:01:36Z

pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, otan (Oliver Tan) wrote…

A signed int in Golang is int64 on 64-bit systems.
Using a constant size of 15 is not great -- this is the bounding.

That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable?

maddyblue

Encoding stuff LGTM.

pkg/util/encoding/encoding_test.go

sumeerbhola

Reviewed 1 of 9 files at r1, 1 of 5 files at r2, 1 of 2 files at r3, 3 of 4 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)

pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, otan (Oliver Tan) wrote…

That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable?

let's write our own inverse function using uint64 -- the algorithm is well documented https://en.wikipedia.org/wiki/Hilbert_curve#Applications_and_mapping_algorithms

pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):

	}
	n := len(b)
	b = encodeBytesAscendingWithTerminator(b, data, ascendingGeoEscapes.escapedTerm)

How about adding a comment
// These bytes don't provide a semantically meaningful order, unlike the curveIndex, so we encode these in ascending order even though this is the EncodeGeoDescending function.

Or maybe this was meant to be descending since I see DecodeGeoDescending() doing descendingGeoEscapes. If this was a typo, can you add a test that would catch it.

otan · 2020-07-28T00:49:04Z

pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):
These is a line below:

onesComplement(b[n:])

that inverses this. I can add a comment here, but tests do catch it.

otan · 2020-07-28T00:51:23Z

pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

let's write our own inverse function using uint64 -- the algorithm is well documented https://en.wikipedia.org/wiki/Hilbert_curve#Applications_and_mapping_algorithms

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough
Also if we're going to write our own, we could use https://github.com/rawrunprotected/hilbert_curves/ which is advertised as log(n), but I'm not sure I understand the math.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)

pkg/geo/geo.go, line 309 at r4 (raw file):

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough

It is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.

Regarding the actual algorithm, I'd rather go with the common one.

otan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)

pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough

It is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.

Regarding the actual algorithm, I'd rather go with the common one.

Done.

pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

These is a line below:

onesComplement(b[n:])

that inverses this. I can add a comment here, but tests do catch it.

Done.

pkg/util/encoding/encoding_test.go, line 1208 at r5 (raw file):

Previously, mjibson (Matt Jibson) wrote…

This breaks subtests since you can no longer run just one. Is it annoying to use testCases[i-1] instead? I don't feel strongly, leave it as is if you want.

Done.

sumeerbhola

Reviewed 7 of 7 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)

pkg/geo/geo.go, line 309 at r7 (raw file):

	yBounds := (bounds.MaxY - bounds.MinY) + 1

	// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].

s/two/to/

The logic in this function seems a bit confusing and I am not sure about its correctness:

it isn't quite obvious why the xBounds and yBounds equation above ensures that the xPos,yPos computed later will be < N. If xBounds is a power of 2 and ((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX)) is 1, then xPos=N I think.
If the bounds are coming from the SRID, I don't see why they can't exceed the size of uint32 which is undesirable since the biggestBoxLength could exceed uint32.

I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.

	const boxLength = 1 << 32
	// Add 1 to each bound so that we normalize the coordinates to [0, 1) before
	// multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1].
	xBounds := (bounds.MaxX - bounds.MinX) + 1
	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
        return hilbertInverse(boxLength, xPos, yPos)

pkg/geo/geo_test.go, line 565 at r7 (raw file):

}

func TestGeographyHash(t *testing.T) {

why hash?

pkg/util/encoding/encoding.go, line 982 at r7 (raw file):

// returns the new buffer.
// It is sorted by the given curve index, followed by the bytes of the spatial object.
func EncodeGeoAscending(b []byte, curveIndex uint64, g *geopb.SpatialObject) ([]byte, error) {

is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?

otan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)

pkg/geo/geo.go, line 309 at r7 (raw file):

Previously, sumeerbhola wrote…

s/two/to/

The logic in this function seems a bit confusing and I am not sure about its correctness:

it isn't quite obvious why the xBounds and yBounds equation above ensures that the xPos,yPos computed later will be < N. If xBounds is a power of 2 and ((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX)) is 1, then xPos=N I think.

If the bounds are coming from the SRID, I don't see why they can't exceed the size of uint32 which is undesirable since the biggestBoxLength could exceed uint32.

I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.
	const boxLength = 1 << 32
	// Add 1 to each bound so that we normalize the coordinates to [0, 1) before
	// multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1].
	xBounds := (bounds.MaxX - bounds.MinX) + 1
	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
        return hilbertInverse(boxLength, xPos, yPos)

Done.

pkg/geo/geo_test.go, line 565 at r7 (raw file):

Previously, sumeerbhola wrote…

why hash?

forgot to rename it.

pkg/util/encoding/encoding.go, line 982 at r7 (raw file):

Previously, sumeerbhola wrote…

is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?

No, that is encoded elsewhere. This is specifically for key encodings which require ordering (indexes & PKs)

sumeerbhola

Reviewed 5 of 5 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)

pkg/geo/geo.go, line 311 at r8 (raw file):

	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)

// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].

pkg/geo/geo_test.go, line 641 at r8 (raw file):

				"LINESTRING(0 0, -90 -80)",
				"POINT(-80 80)",
				"POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))",

Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?

otan

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)

pkg/geo/geo.go, line 311 at r8 (raw file):

Previously, sumeerbhola wrote…

// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].

Done.

pkg/geo/geo_test.go, line 641 at r8 (raw file):

Previously, sumeerbhola wrote…

Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?

Done.

This follows the [PostGIS way](https://info.crunchydata.com/blog/waiting-for-postgis-3-hilbert-geometry-sorting), but does not follow the same encoding mechanism, so we cannot compare results directly. For Geography we re-use the S2 infrastructure and for Geometry we use the google/hilbert library. Release note (sql change): When ordering by geospatial columns, it will now order by the Hilbert Space Curve index so that points which are geographically similar are clustered together.

otan · 2020-08-03T18:38:13Z

bors r=sumeerbhola

craig · 2020-08-03T18:38:27Z

Canceled.

otan · 2020-08-03T18:38:35Z

bors r=sumeerbhola

craig · 2020-08-03T19:49:51Z

Build succeeded:

GitHub CI (Cockroach)

otan force-pushed the encode_key_prop branch 2 times, most recently from 5a70cdf to c751ffa Compare July 27, 2020 14:35

otan marked this pull request as ready for review July 27, 2020 14:35

otan requested review from sumeerbhola and a team July 27, 2020 14:35

maddyblue reviewed Jul 27, 2020

View reviewed changes

pkg/geo/geo.go Outdated Show resolved Hide resolved

pkg/geo/geo.go Outdated Show resolved Hide resolved

otan force-pushed the encode_key_prop branch from c751ffa to 41820b7 Compare July 27, 2020 16:42

maddyblue reviewed Jul 27, 2020

View reviewed changes

pkg/util/encoding/encoding.go Outdated Show resolved Hide resolved

pkg/util/encoding/encoding.go Outdated Show resolved Hide resolved

otan force-pushed the encode_key_prop branch from 41820b7 to 0f69c70 Compare July 27, 2020 17:20

maddyblue reviewed Jul 27, 2020

View reviewed changes

pkg/util/encoding/encoding.go Show resolved Hide resolved

sumeerbhola requested changes Jul 27, 2020

View reviewed changes

otan commented Jul 27, 2020

View reviewed changes

otan force-pushed the encode_key_prop branch from 6252e1f to aef51d7 Compare July 27, 2020 20:56

maddyblue approved these changes Jul 27, 2020

View reviewed changes

pkg/util/encoding/encoding_test.go Outdated Show resolved Hide resolved

otan force-pushed the encode_key_prop branch from aef51d7 to b56e308 Compare July 27, 2020 22:04

sumeerbhola requested changes Jul 28, 2020

View reviewed changes

sumeerbhola reviewed Jul 28, 2020

View reviewed changes

otan force-pushed the encode_key_prop branch 4 times, most recently from 78976f7 to 2d2dd6e Compare July 28, 2020 19:59

otan commented Jul 28, 2020

View reviewed changes

sumeerbhola requested changes Jul 29, 2020

View reviewed changes

otan force-pushed the encode_key_prop branch from 2d2dd6e to 0e13d3e Compare July 29, 2020 20:17

otan commented Jul 29, 2020

View reviewed changes

otan force-pushed the encode_key_prop branch from 0e13d3e to 03a7a37 Compare July 30, 2020 20:47

sumeerbhola approved these changes Aug 3, 2020

View reviewed changes

otan commented Aug 3, 2020

View reviewed changes

otan force-pushed the encode_key_prop branch from 03a7a37 to be0ed56 Compare August 3, 2020 17:27

otan force-pushed the encode_key_prop branch from be0ed56 to 819e0c3 Compare August 3, 2020 18:38

craig bot merged commit 66a3c19 into cockroachdb:master Aug 3, 2020

lnhsingh mentioned this pull request Aug 26, 2020

geo: order geospatial objects by Hilbert Curve index cockroachdb/docs#8182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geo: order geospatial objects by Hilbert Curve index #51898

geo: order geospatial objects by Hilbert Curve index #51898

otan commented Jul 25, 2020 •

edited

Loading

cockroach-teamcity commented Jul 25, 2020

maddyblue left a comment

maddyblue left a comment

maddyblue commented Jul 27, 2020

sumeerbhola left a comment

otan commented Jul 27, 2020

otan left a comment

otan commented Jul 27, 2020

maddyblue left a comment

sumeerbhola left a comment

otan commented Jul 28, 2020

otan commented Jul 28, 2020

sumeerbhola left a comment

otan left a comment

sumeerbhola left a comment

otan left a comment

sumeerbhola left a comment

otan left a comment

otan commented Aug 3, 2020

craig bot commented Aug 3, 2020

otan commented Aug 3, 2020

craig bot commented Aug 3, 2020

geo: order geospatial objects by Hilbert Curve index #51898

geo: order geospatial objects by Hilbert Curve index #51898

Conversation

otan commented Jul 25, 2020 • edited Loading

cockroach-teamcity commented Jul 25, 2020

maddyblue left a comment

Choose a reason for hiding this comment

maddyblue left a comment

Choose a reason for hiding this comment

maddyblue commented Jul 27, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

otan commented Jul 27, 2020

otan left a comment

Choose a reason for hiding this comment

otan commented Jul 27, 2020

maddyblue left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

otan commented Jul 28, 2020

otan commented Jul 28, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

otan left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

otan left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

otan left a comment

Choose a reason for hiding this comment

otan commented Aug 3, 2020

craig bot commented Aug 3, 2020

otan commented Aug 3, 2020

craig bot commented Aug 3, 2020

otan commented Jul 25, 2020 •

edited

Loading