Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build triangle tree from geo_shape indexed fields #50012

Merged
merged 3 commits into from
Dec 12, 2019

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Dec 10, 2019

The current version of the triangle tree copies quite a lot of Lucene code for generating the triangles. This PR uses the already generated Indexed fields and use the method ShapeField#decodeTriangle to generate those triangles reducing the complexity of the TriangleTreeWriter.

In addition this Pr adds:

  • It uses Lucene's ByteBuffersDataOutput and ByteArrayDataInput to serialise / deserialise the tree. This allows better reusability of the read object.
  • BoundingBox object is now reusable as well.

@iverase iverase added the :Analytics/Geo Indexing, search aggregations of geo points and shapes label Dec 10, 2019
@iverase iverase requested review from talevy and imotov December 10, 2019 08:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

@iverase
Copy link
Contributor Author

iverase commented Dec 10, 2019

Ups, I think the extent logic is wrong. In this case the extent is built using the bounding box of the triangles and it seems the result is not the expected one. I am starting having doubts if this logic really design for points can work for shapes.

@iverase
Copy link
Contributor Author

iverase commented Dec 10, 2019

Ok I think I fix the extent logic. Whenever you have a shape that crosses the dateline, we flag that the extent cannot we wrapped around the dateline by setting the negRight and posLeft values to 0.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general. I left a couple of structural suggestions and some questions.

if (fieldType().hasDocValues()) {
geometryIndexer.indexDocValueField(context, shape);
// doc values are generated from the indexed fields.
ShapeField.DecodedTriangle[] triangles = new ShapeField.DecodedTriangle[fields.size()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would push this inside GeometryIndexer's indexShape() it will encapsulate all decomposition/tessellation/convertion to lucene logic. It should allow us to write some unit tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support index = false in geo_shape? It seems that is ignored and the shape will always be indexed regardless of that setting. My suggestion would be to change the indexShape method to something like:

List<IndexableField> indexShape(ParseContext context, Processed shape, boolean index, boolean docValues);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again is a bit more difficult. because we need to support arrays of shapes and we can only have one geo shape doc value per document.

@@ -32,12 +45,13 @@
private double sumY;
private int count;

public CentroidCalculator() {
public CentroidCalculator(Geometry geometry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class feels a bit brittle since it is used for calculation and then passed as the final result. I wonder if we could pass just X, Y after we are done with calculations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot do that because we need to support arrays of shapes and therefore the calculation of the centroid needs the calculator. This is something it should change in #49887 so I think we should not worry too much. Depending in how we decide to calculate the centroid we would need to change this class accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if we are to change how the calculation is structured, it is best to do that in the dedicated calculator issue Ignacio mentioned

@@ -95,13 +104,14 @@ public GeoPoint geoPoint() {

@Override
public BoundingBox boundingBox() {
return new BoundingBox(geoPoint);
boundingBox.reset(geoPoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment why this doesn't happen in the constructor?

public final double posLeft;
public final double posRight;

public BoundingBox(Extent extent, CoordinateEncoder coordinateEncoder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much this really buys us performance-wise in modern JVM? Did you run some benchmarks by any chance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some performance testing in a throw-away-branch (independent of this) where I limited the creation of Extents and such. I didn't see a significant performance difference, just a difference in object allocations which may help reduce garbage collection time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all short lived objects. So, my gut feel it shouldn't be bad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. it was less significant for when there was one BoundingBox per document/field-value.

The more important place where there is significant churn is in the geo-grid tiler that creates a new Rectangle for every tile queried. There is where I did not see a performance difference, just an object allocation difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughput of the query should not be much different but latency can suffer if we create millions of this objects as gc will play a role. This code runs on a tight loop and you can have several of this loops running in parallel so creating objects per document is a bad idea.

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I do agree that the centroid-calculator feels awkward, but it is going to be required to change a little when #49887 is worked on

@iverase iverase merged commit 6f5919a into elastic:geoshape-doc-values Dec 12, 2019
@iverase iverase deleted the fromIndexedFields branch December 12, 2019 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants