-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce EdgeTree writer and reader #40810
Conversation
Pinging @elastic/es-analytics-geo |
7c18053
to
4d87465
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start! It'b be nice to take advantage of some invariants to improve space-efficiency. For instance instead of recording maxY
as an int, we could maybe record maxY-minY
as a vint? I'm sure there are other similar things we could do.
server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jpountz it is a great start. Note that I have been refactoring the Lucene implementation and I think this implementation can benefit from it. If you want, have look at this PR:
server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java
Outdated
Show resolved
Hide resolved
@jpountz regarding:
My goal in using consistent sizing of the integers was to have the ability to skip serializing of tree branches if they were unnecessary. If the sizes of each node are variable, it makes this difficult. |
This is a good question, I think the answer is going to depend on how much we can save. If only a couple percents, then it's probably not worth doing. However if simple compression can achieve significant savings then I'm sure that the question will come in the future again and we will have to deal with backward compatibility, which is why I'd like to raise this question now. It will indeed make serialization harder. |
This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon
Hi @jpountz, Since which format we take is dependent on further testing with I'm working, on a separate branch, to test the performance of this simple GeometryTree that only supports polygons without holes, and stores multi-polygons in a list, rather than a KDTree. I'd like to keep things simple for now and build on the structure in steps. The reason I introduced the GeometryTree was to show how the EdgeTree would be created from the Geometry object that is being indexed. let me know what you think! |
The first thing it comes to mind is that this structure will not be efficient for |
@iverase thanks. I've added the GeometryTree's extent so that it can be used to filter out non-matches or trivial matches. I'm currently working on benchmarking for this |
to keep this PR smaller, and introduce fewer "TODO" items, I've removed the GeometryTree from this PR. |
Closing to re-evaluate implementation. further investigations necessary before settling on exact data structure |
I've re-opened this PR since we have now decided to continue forward with this implementation! |
run elasticsearch-ci/packaging-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1... this is looking good. It's a feature branch so I'm not opposed looking into using Lucene GeoTestUtil
in a future PR.
|
||
public void testRectangleShape() throws IOException { | ||
for (int i = 0; i < 1000; i++) { | ||
int minX = randomIntBetween(-180, 170); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leverage lucene's GeoTestUtil
? There are test methods that tend to create adversarial rectangles that either cross dateline (nextBox
) or not (nextBoxNotCrossingDateline
)?
|
||
public void testSimplePolygon() throws IOException { | ||
for (int iter = 0; iter < 1000; iter++) { | ||
ShapeBuilder builder = RandomShapeGenerator.createShape(random(), RandomShapeGenerator.ShapeType.POLYGON); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're just creating Lucene Polygon
, I think we could also use Lucene's GeoTestUtil
here so we don't have to rely on the builder. GeoTestUtil.nextPolygon()
will create a lot of adversarial cases for us to stress test the edge tree.
thanks Nick! I will look into GeoTestUtil usage in a follow-up |
This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon
This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon
This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.
This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon. In this first pass, only polygons without holes are supported
Using an OSM dataset of 27,105,113 valid linear rings, I went ahead
and indexed them using the implementation that has a GeometryTree with a bounding box
of the inner EdgeTree. more optimizations and reduction in abstraction/redundancy when only one EdgeTree exists can be done.
doc mapping
example document
results from rally:
disk usage report: