Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update centroid calculation of geo_shape for geo-centroid #49887

Closed
talevy opened this issue Dec 5, 2019 · 7 comments
Closed

Update centroid calculation of geo_shape for geo-centroid #49887

talevy opened this issue Dec 5, 2019 · 7 comments
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes

Comments

@talevy
Copy link
Contributor

talevy commented Dec 5, 2019

where things are now

As the geoshape-doc-values branch stands today, the centroid is
defined as the centroid of all the geometry's vertices.

Why the update?

After a re-visit of this strategy, we decided that it would be worth
modifying this calculation to be closer in line with how popular
geo-spatial data-stores compute the centroid.

After exploring existing implementations for some of these, it seems there are a few commonalities. I'll list in the detailed definition below.

Proposed detailed definition of a centroid of a geo_shape field

Geometry Type Centroid Calculation
[Multi]Point equally weighted average of all the coordinates
[Multi]LineString a weighted average of all the centroids of each segment, where the weight of each segment is its length in degrees
[Multi]Polygon a weighted average of all the centroids of all the triangles of a polygon where the triangles are formed by every two consecutive vertices and the starting-point. holes have negative weights. weights represent the area of the triangle in deg^2 calculated using The Shoelace Formulate
GeometryCollection The centroid of all the underlying geometries with the highest dimension. If Polygons and Lines and/or Points, then lines and/or points are ignored. If Lines and Points, then points are ignored

Proposed definition of a geo-centroid of a bucket

The un-weighted centroid of the centroid of each shape within the bucket.

Below is a summary of the descriptions from various sources:

  • OGC Standard

"No specific algorithm is specified for the Centroid function; answers may vary with implementation."

- OpenGIS® Implementation Standard for Geographic information - Simple feature access - Part 2: SQL option

The OGC standard does not specify any specific way one
should calculate the centroid.

Returns a point geometry that is the centroid of a polygon, multipolygon, point, or point cluster. (The centroid is also known as the "center of gravity.")

For an input geometry consisting of multiple objects, the result is weighted by the area of each polygon in the geometry objects. If the geometry objects are a mixture of polygons and points, the points are not used in the calculation of the centroid. If the geometry objects are all points, the points have equal weight.

The Oracle definition does not specify how one should implement and define weight.
It interestingly also does not describe the behavior for line-strings.

Computes the geometric center of a geometry, or equivalently, the center of mass of the geometry as a POINT. For [MULTI]POINTs, this is computed as the arithmetic mean of the input coordinates. For [MULTI]LINESTRINGs, this is computed as the weighted length of each line segment. For [MULTI]POLYGONs, "weight" is thought in terms of area. If an empty geometry is supplied, an empty GEOMETRYCOLLECTION is returned. If NULL is supplied, NULL is returned. If CIRCULARSTRING or COMPOUNDCURVE are supplied, they are converted to linestring wtih CurveToLine first, then same than for LINESTRING

The centroid is equal to the centroid of the set of component Geometries of highest dimension

Similarly, this description does not go into detail about how either the "center of mass" or the "weighted length" is calculated.

Discussion Item

  • Is this new definition of centroid the one we would like?
    For example, there can be multiple interpretations of how
    the centroid as an aggregate function should be calculated.
    These specs define ST_Centroid as a simple function on one
    field value. Geo Centroid aggregation is the centroid of a field
    in multiple documents. Should all these field values be treated as
    a singular GeometryCollection which the centroid is derived from,
    or should we treat each document as a singular point (computed using above)
    with equal weight.
  • should weight be measured in degrees and degrees^2, or meters and meters^2?
@talevy talevy added the :Analytics/Geo Indexing, search aggregations of geo points and shapes label Dec 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

@talevy
Copy link
Contributor Author

talevy commented Dec 6, 2019

Here is a proposed WIP implementation of above: talevy@a713f5c

Updating the CentroidCalculator seemed pretty straightforward. The one extra requirement that has yet to be tackled is how to determine which types of shapes exist in the geometry. There are two ideas I have there: one requires yet another iteration of the geometry, another requires more space/variables for each dimension.

@talevy
Copy link
Contributor Author

talevy commented Dec 11, 2019

We discussed this in our analytics/geo team meeting and concluded that:

  • the scope of this centroid definition is to the geo_centroid aggregation, not necessarily all definitions of centroid in other future contexts
  • the centroid of a bucket is that of all geometries that reside in that bucket treaded as one geometry-collection. not a centroid of centroids of shapes
  • center of mass calculations as described will be implemented in degrees and degrees^2

talevy added a commit to talevy/elasticsearch that referenced this issue Dec 11, 2019
This commit serializes the ShapeType of the indexed
geometry. The ShapeType can be useful for other future
features. For one thing: elastic#49887 depends on the ability
to determine what the highest dimensional shape is
for centroid calculations.

GeometryCollection is reduced to the sub-shape of the
higest dimension

relates elastic#37206.
@talevy
Copy link
Contributor Author

talevy commented Dec 11, 2019

Information about a shape's dimension for centroid calculation is introduced in #50104.

@jpountz
Copy link
Contributor

jpountz commented Dec 12, 2019

the centroid of a bucket is that of all geometries that reside in that bucket treaded as one geometry-collection. not a centroid of centroids of shapes

+1 this is also more consistent with the current behavior of the centroid aggregation on points in the case of a multi-valued field

@talevy
Copy link
Contributor Author

talevy commented Dec 12, 2019

excuse me. I forgot to mention that there is more than just the shape's type/dimension that needs to be serialized: the sum of the weight (can be negative)

talevy added a commit that referenced this issue Dec 18, 2019
This commit serializes the ShapeType of the indexed
geometry. The ShapeType can be useful for other future
features. For one thing: #49887 depends on the ability
to determine what the highest dimensional shape is
for centroid calculations.

GeometryCollection is reduced to the sub-shape of the
highest dimension

relates #37206.
talevy added a commit that referenced this issue Dec 19, 2019
This commit serializes the ShapeType of the indexed
geometry. The ShapeType can be useful for other future
features. For one thing: #49887 depends on the ability
to determine what the highest dimensional shape is
for centroid calculations.

GeometryCollection is reduced to the sub-shape of the
highest dimension

relates #37206.
talevy added a commit that referenced this issue Jan 8, 2020
This PR implements proper centroid calculations of geometries according to the 
definition defined in #49887. To compute things correctly, an additional variable 
encoded long representing the total weight for the centroid of the geometry in 
a tree. This weight is always positive. Some tests are fixed, as they did not 
have valid geometries.

closes #49887.
@talevy
Copy link
Contributor Author

talevy commented Feb 3, 2020

this was implemented in #50297. closing

@talevy talevy closed this as completed Feb 3, 2020
talevy added a commit that referenced this issue Feb 10, 2020
This PR implements proper centroid calculations of geometries according to the
definition defined in #49887. To compute things correctly, an additional variable
encoded long representing the total weight for the centroid of the geometry in
a tree. This weight is always positive. Some tests are fixed, as they did not
have valid geometries.

closes #49887.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes
Projects
None yet
Development

No branches or pull requests

3 participants