Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geo: order geospatial objects by Hilbert Curve index #51898

Merged
merged 1 commit into from
Aug 3, 2020

Conversation

otan
Copy link
Contributor

@otan otan commented Jul 25, 2020

This follows the PostGIS way,
but does not follow the same encoding.

Visualisation for Geometry for 10000 random points:
image

Visualisation for Geography 10000 random points:
image

Release note (sql change): When ordering by geospatial columns, it will
now order by the Hilbert Space Curve index so that points which are
geographically similar are clustered together.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan otan force-pushed the encode_key_prop branch 2 times, most recently from 5a70cdf to c751ffa Compare July 27, 2020 14:35
@otan otan marked this pull request as ready for review July 27, 2020 14:35
@otan otan requested review from sumeerbhola and a team July 27, 2020 14:35
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now CompareSpatialObject and SpaceCurveIndex. It is then up to callers in other packages that they need to call both in a specific order. Should we instead make a single function in this package that does the correct thing? If not, we should document somewhere the relationship between those two functions and how they should be used when comparing objects.

pkg/geo/geo.go Outdated Show resolved Hide resolved
pkg/geo/geo.go Outdated Show resolved Hide resolved
@otan otan force-pushed the encode_key_prop branch from c751ffa to 41820b7 Compare July 27, 2020 16:42
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to remove the geo exceptions in sqlbase/encoded_datum_test.go so that TestEncDatumCompare tests these now.

pkg/util/encoding/encoding.go Outdated Show resolved Hide resolved
pkg/util/encoding/encoding.go Outdated Show resolved Hide resolved
@otan otan force-pushed the encode_key_prop branch from 41820b7 to 0f69c70 Compare July 27, 2020 17:20
@maddyblue
Copy link
Contributor

Also someone else should review the implementation of the new geo stuff here. I'm not qualified.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)


pkg/geo/geo.go, line 281 at r4 (raw file):

}

// SpaceCurveIndex returns an uint64 index to use representing an index into a Space Curve.

nit: ... an index into a Space-filling Curve.


pkg/geo/geo.go, line 291 at r4 (raw file):

	centerX := (bbox.BoundingBox.LoX + bbox.BoundingBox.HiX) / 2
	centerY := (bbox.BoundingBox.LoY + bbox.BoundingBox.HiY) / 2
	// By default, bound by MaxFloat32.

The comment says MaxFloat32 but the code is doing MaxInt32.


pkg/geo/geo.go, line 307 at r4 (raw file):

	// Find the longest dimension, add one to be inclusive.
	// TODO(otan): investigate storing this calculation on the projection itself for perf.

How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)


pkg/geo/geo.go, line 309 at r4 (raw file):

	// TODO(otan): investigate storing this calculation on the projection itself for perf.
	biggestBoxLength := math.Max(bounds.MaxX-bounds.MinX, bounds.MaxY-bounds.MinY) + 1
	boxSize := 1 << 40

what is the reason for this value 1 << 40 versus some other power of 2? Won't this overflow when doing (2^40)^2-1?

It is not clear why we are trying to compute a variable box size since we normalize to [0, 1] below before multiplying by the box size. Given the hilbert library is using signed int, it seems simpler to use a constant box size of 2^15, so that we get a maximum return value of 2^30-1 (I didn't look at the implementation of the library to see what it does wrt overflow) which gives us the most ability to distinguish between different centers.

@otan
Copy link
Contributor Author

otan commented Jul 27, 2020


pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

what is the reason for this value 1 << 40 versus some other power of 2? Won't this overflow when doing (2^40)^2-1?

It is not clear why we are trying to compute a variable box size since we normalize to [0, 1] below before multiplying by the box size. Given the hilbert library is using signed int, it seems simpler to use a constant box size of 2^15, so that we get a maximum return value of 2^30-1 (I didn't look at the implementation of the library to see what it does wrt overflow) which gives us the most ability to distinguish between different centers.

A signed int in Golang is int64 on 64-bit systems.
Using a constant size of 15 is not great -- this is the bounding.

image.png

Copy link
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)


pkg/geo/geo.go, line 540 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

More comments, as above.

Done.


pkg/geo/geo.go, line 291 at r4 (raw file):

Previously, sumeerbhola wrote…

The comment says MaxFloat32 but the code is doing MaxInt32.

Done.


pkg/geo/geo.go, line 307 at r4 (raw file):

Previously, sumeerbhola wrote…

How about a comment like:
// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].
// The hilbert curve value will be in the interval [0, N^2-1]. The x, y coordinates will be first normalized
// below to the interval [0, 1].
(the rest of the comment is a TODO, depending on the questions below)

Done.


pkg/util/encoding/encoding.go, line 1029 at r2 (raw file):

Previously, mjibson (Matt Jibson) wrote…

add a comment here for what the boolean argument is

Done.


pkg/util/encoding/encoding.go, line 1052 at r2 (raw file):

Previously, mjibson (Matt Jibson) wrote…

add a comment here for what the boolean argument is

Done.


pkg/util/encoding/encoding.go, line 1555 at r3 (raw file):

Previously, mjibson (Matt Jibson) wrote…

You have not verified that len(b) >= 8 here, so this could panic.

Done.

@otan otan force-pushed the encode_key_prop branch from 6252e1f to aef51d7 Compare July 27, 2020 20:56
@otan
Copy link
Contributor Author

otan commented Jul 27, 2020


pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, otan (Oliver Tan) wrote…

A signed int in Golang is int64 on 64-bit systems.
Using a constant size of 15 is not great -- this is the bounding.

image.png

That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable?

Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding stuff LGTM.

pkg/util/encoding/encoding_test.go Outdated Show resolved Hide resolved
@otan otan force-pushed the encode_key_prop branch from aef51d7 to b56e308 Compare July 27, 2020 22:04
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r1, 1 of 5 files at r2, 1 of 2 files at r3, 3 of 4 files at r5, 1 of 1 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)


pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, otan (Oliver Tan) wrote…

That said, this is for a 10000x10000 cell square for SRID 3857 (-20037508.342789244 to 20037508.342789244), so maybe this is acceptable?

let's write our own inverse function using uint64 -- the algorithm is well documented https://en.wikipedia.org/wiki/Hilbert_curve#Applications_and_mapping_algorithms


pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):

	}
	n := len(b)
	b = encodeBytesAscendingWithTerminator(b, data, ascendingGeoEscapes.escapedTerm)

How about adding a comment
// These bytes don't provide a semantically meaningful order, unlike the curveIndex, so we encode these in ascending order even though this is the EncodeGeoDescending function.

Or maybe this was meant to be descending since I see DecodeGeoDescending() doing descendingGeoEscapes. If this was a typo, can you add a test that would catch it.

@otan
Copy link
Contributor Author

otan commented Jul 28, 2020


pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):
These is a line below:

onesComplement(b[n:])

that inverses this. I can add a comment here, but tests do catch it.

@otan
Copy link
Contributor Author

otan commented Jul 28, 2020


pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

let's write our own inverse function using uint64 -- the algorithm is well documented https://en.wikipedia.org/wiki/Hilbert_curve#Applications_and_mapping_algorithms

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough
Also if we're going to write our own, we could use https://github.com/rawrunprotected/hilbert_curves/ which is advertised as log(n), but I'm not sure I understand the math.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)


pkg/geo/geo.go, line 309 at r4 (raw file):

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough

It is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.

Regarding the actual algorithm, I'd rather go with the common one.

@otan otan force-pushed the encode_key_prop branch 4 times, most recently from 78976f7 to 2d2dd6e Compare July 28, 2020 19:59
Copy link
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)


pkg/geo/geo.go, line 309 at r4 (raw file):

Previously, sumeerbhola wrote…

Something I'm not quite understanding -- is there a reason to?
I think this one is fine; it's just with a small resolution (i.e. 15), it doesn't go deep enough

It is a reasonable question. My opinion is that writing those 15 lines of code is faster than trying to debate or think through all the cons of the tradeoff we are making by limiting ourselves to 2^15 resolution for each coordinate. And your visualization with the intersecting lines in the lower resolution setting look quite compellingly bad.

Regarding the actual algorithm, I'd rather go with the common one.

Done.


pkg/util/encoding/encoding.go, line 1006 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

These is a line below:

onesComplement(b[n:])

that inverses this. I can add a comment here, but tests do catch it.

Done.


pkg/util/encoding/encoding_test.go, line 1208 at r5 (raw file):

Previously, mjibson (Matt Jibson) wrote…

This breaks subtests since you can no longer run just one. Is it annoying to use testCases[i-1] instead? I don't feel strongly, leave it as is if you want.

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)


pkg/geo/geo.go, line 309 at r7 (raw file):

	yBounds := (bounds.MaxY - bounds.MinY) + 1

	// We need two compute N such that N is a power of 2 and the x, y coordinates are in [0, N-1].

s/two/to/

The logic in this function seems a bit confusing and I am not sure about its correctness:

  • it isn't quite obvious why the xBounds and yBounds equation above ensures that the xPos,yPos computed later will be < N. If xBounds is a power of 2 and ((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX)) is 1, then xPos=N I think.

  • If the bounds are coming from the SRID, I don't see why they can't exceed the size of uint32 which is undesirable since the biggestBoxLength could exceed uint32.

I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.

	const boxLength = 1 << 32
	// Add 1 to each bound so that we normalize the coordinates to [0, 1) before
	// multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1].
	xBounds := (bounds.MaxX - bounds.MinX) + 1
	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
        return hilbertInverse(boxLength, xPos, yPos)

pkg/geo/geo_test.go, line 565 at r7 (raw file):

}

func TestGeographyHash(t *testing.T) {

why hash?


pkg/util/encoding/encoding.go, line 982 at r7 (raw file):

// returns the new buffer.
// It is sorted by the given curve index, followed by the bytes of the spatial object.
func EncodeGeoAscending(b []byte, curveIndex uint64, g *geopb.SpatialObject) ([]byte, error) {

is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?

@otan otan force-pushed the encode_key_prop branch from 2d2dd6e to 0e13d3e Compare July 29, 2020 20:17
Copy link
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)


pkg/geo/geo.go, line 309 at r7 (raw file):

Previously, sumeerbhola wrote…

s/two/to/

The logic in this function seems a bit confusing and I am not sure about its correctness:

  • it isn't quite obvious why the xBounds and yBounds equation above ensures that the xPos,yPos computed later will be < N. If xBounds is a power of 2 and ((centerX - bounds.MinX) / (bounds.MaxX - bounds.MinX)) is 1, then xPos=N I think.

  • If the bounds are coming from the SRID, I don't see why they can't exceed the size of uint32 which is undesirable since the biggestBoxLength could exceed uint32.

I think something like the following would work since it uses the bounds only for normalizing to [0, 1). Once that normalization is done, we can use a constant sized square for the hilbert curve with a side of 2^32.

	const boxLength = 1 << 32
	// Add 1 to each bound so that we normalize the coordinates to [0, 1) before
	// multiplying by boxLength to give coordinates that are integers in the interval [0, boxLength-1].
	xBounds := (bounds.MaxX - bounds.MinX) + 1
	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)
        return hilbertInverse(boxLength, xPos, yPos)

Done.


pkg/geo/geo_test.go, line 565 at r7 (raw file):

Previously, sumeerbhola wrote…

why hash?

forgot to rename it.


pkg/util/encoding/encoding.go, line 982 at r7 (raw file):

Previously, sumeerbhola wrote…

is this function used for encoding a geometry column even when it is not part of a key (when the ordering does not matter)?

No, that is encoded elsewhere. This is specifically for key encodings which require ordering (indexes & PKs)

@otan otan force-pushed the encode_key_prop branch from 0e13d3e to 03a7a37 Compare July 30, 2020 20:47
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mjibson, @otan, and @sumeerbhola)


pkg/geo/geo.go, line 311 at r8 (raw file):

	yBounds := (bounds.MaxY - bounds.MinY) + 1
	xPos := uint64(((centerX - bounds.MinX) / xBounds) * boxLength)
	yPos := uint64(((centerY - bounds.MinY) / yBounds) * boxLength)

// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].


pkg/geo/geo_test.go, line 641 at r8 (raw file):

				"LINESTRING(0 0, -90 -80)",
				"POINT(-80 80)",
				"POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))",

Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?

Copy link
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @sumeerbhola)


pkg/geo/geo.go, line 311 at r8 (raw file):

Previously, sumeerbhola wrote…

// hilbertInverse returns values in the interval [0, boxLength^2-1], so [0, 2^64-1].

Done.


pkg/geo/geo_test.go, line 641 at r8 (raw file):

Previously, sumeerbhola wrote…

Do you have shapes here that exceed the SRID bounds, and another that is at the bounds, so results in an index of 2^64 -1? Could you add a couple of such cases with assertions on the actual index values (not just relative comparisons)?

Done.

@otan otan force-pushed the encode_key_prop branch from 03a7a37 to be0ed56 Compare August 3, 2020 17:27
This follows the [PostGIS way](https://info.crunchydata.com/blog/waiting-for-postgis-3-hilbert-geometry-sorting),
but does not follow the same encoding mechanism, so we cannot compare
results directly.

For Geography we re-use the S2 infrastructure and for Geometry we use
the google/hilbert library.

Release note (sql change): When ordering by geospatial columns, it will
now order by the Hilbert Space Curve index so that points which are
geographically similar are clustered together.
@otan
Copy link
Contributor Author

otan commented Aug 3, 2020

bors r=sumeerbhola

@otan otan force-pushed the encode_key_prop branch from be0ed56 to 819e0c3 Compare August 3, 2020 18:38
@craig
Copy link
Contributor

craig bot commented Aug 3, 2020

Canceled.

@otan
Copy link
Contributor Author

otan commented Aug 3, 2020

bors r=sumeerbhola

@craig
Copy link
Contributor

craig bot commented Aug 3, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants