introduce EdgeTree writer and reader #40810

talevy · 2019-04-03T18:42:22Z

This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon. In this first pass, only polygons without holes are supported

Using an OSM dataset of 27,105,113 valid linear rings, I went ahead
and indexed them using the implementation that has a GeometryTree with a bounding box
of the inner EdgeTree. more optimizations and reduction in abstraction/redundancy when only one EdgeTree exists can be done.

doc mapping

            "properties": {
              "points": {
                "type": "geo_point"
              },
              "shape": {
                "type": "geo_shape"
              }
            }

example document

{"points":[[-160.50130570000002, 55.3354488], [-160.5014444, 55.3352878], [-160.5015342, 55.335312800000004], [-160.50161590000002, 55.3352179], [-160.501349, 55.3351435], [-160.5012595, 55.3352474], [-160.50132630000002, 55.335266000000004], [-160.5011954, 55.335418100000005], [-160.50130570000002, 55.3354488]],"shape":{"type":"Polygon","coordinates":[[[-160.50130570000002, 55.3354488], [-160.5014444, 55.3352878], [-160.5015342, 55.335312800000004], [-160.50161590000002, 55.3352179], [-160.501349, 55.3351435], [-160.5012595, 55.3352474], [-160.50132630000002, 55.335266000000004], [-160.5011954, 55.335418100000005], [-160.50130570000002, 55.3354488]]]}}

results from rally:

|   Lap |                          Metric |   Task |   Value |   Unit |
|------:|--------------------------------:|-------:|--------:|-------:|
|   All |              Total Young Gen GC |        |       0 |      s |
|   All |                Total Old Gen GC |        |       0 |      s |
|   All |                  Min Throughput |   bulk |  571.08 | docs/s |
|   All |               Median Throughput |   bulk | 1951.59 | docs/s |
|   All |                  Max Throughput |   bulk | 3436.21 | docs/s |
|   All |         50th percentile latency |   bulk | 76.1577 |     ms |
|   All |         90th percentile latency |   bulk | 165.303 |     ms |
|   All |         99th percentile latency |   bulk | 1780.44 |     ms |
|   All |       99.9th percentile latency |   bulk | 3409.94 |     ms |
|   All |      99.99th percentile latency |   bulk | 9411.48 |     ms |
|   All |        100th percentile latency |   bulk | 60022.5 |     ms |
|   All |    50th percentile service time |   bulk | 76.1577 |     ms |
|   All |    90th percentile service time |   bulk | 165.303 |     ms |
|   All |    99th percentile service time |   bulk | 1780.44 |     ms |
|   All |  99.9th percentile service time |   bulk | 3409.94 |     ms |
|   All | 99.99th percentile service time |   bulk | 9411.48 |     ms |
|   All |   100th percentile service time |   bulk | 60022.5 |     ms |
|   All |                      error rate |   bulk |   22.91 |      % |

disk usage report:

total disk:     24,900,881,453
num docs:           27,105,113
stored fields:   5,197,206,742
term vectors:                0
norms:                       0
docvalues:      11,765,499,000
postings:            1,727,908
prox:                        0
points:          7,651,749,909
terms:             284,696,124

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
        shape  14,117,465,871               0               0               0   4,992,899,775   9,124,566,096          100.0%               4bytes/7D binary
       points   5,183,648,448               0               0               0   2,601,226,493   2,582,421,955          100.0%       4bytes/2D sorted_numeric
          _id     284,696,099     284,695,835             110               0               0             154            0.0%                           docs
      _seq_no     114,306,603               0               0               0      55,796,578      58,510,025          100.0%              8bytes/1D numeric
 _field_names       1,728,241             289       1,727,798               0               0             154            0.0%                           docs
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source             154               0               0               0               0             154            0.0%

elasticmachine · 2019-04-03T18:42:24Z

Pinging @elastic/es-analytics-geo

jpountz

This looks like a good start! It'b be nice to take advantage of some invariants to improve space-efficiency. For instance instead of recording maxY as an int, we could maybe record maxY-minY as a vint? I'm sure there are other similar things we could do.

server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java

iverase

I agree with @jpountz it is a great start. Note that I have been refactoring the Lucene implementation and I think this implementation can benefit from it. If you want, have look at this PR:

apache/lucene-solr#627

server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java

talevy · 2019-04-05T15:39:03Z

thanks for the reviews @jpountz and @iverase! I will follow-up with another pass

talevy · 2019-04-05T15:42:00Z

@jpountz regarding:

For instance instead of recording maxY as an int, we could maybe record maxY-minY as a vint? I'm sure there are other similar things we could do.

My goal in using consistent sizing of the integers was to have the ability to skip serializing of tree branches if they were unnecessary. If the sizes of each node are variable, it makes this difficult.
Is there a way to have both? or is one more important than the other?

jpountz · 2019-04-05T16:43:59Z

This is a good question, I think the answer is going to depend on how much we can save. If only a couple percents, then it's probably not worth doing. However if simple compression can achieve significant savings then I'm sure that the question will come in the future again and we will have to deal with backward compatibility, which is why I'd like to raise this question now. It will indeed make serialization harder.

This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon

talevy · 2019-04-10T00:50:30Z

Hi @jpountz, Since which format we take is dependent on further testing with
real-data, I'd like to first assess whether this is a good start for the feature-branch and
then we can iterate on performance changes on the branch.

I'm working, on a separate branch, to test the performance of this simple GeometryTree that only supports polygons without holes, and stores multi-polygons in a list, rather than a KDTree. I'd like to keep things simple for now and build on the structure in steps. The reason I introduced the GeometryTree was to show how the EdgeTree would be created from the Geometry object that is being indexed.

let me know what you think!

jpountz · 2019-04-10T07:22:56Z

I'll defer to @nknize and @iverase. :)

iverase · 2019-04-10T08:47:00Z

The first thing it comes to mind is that this structure will not be efficient for Geo Bounds aggregation or Geo Centroid aggregation. I think at least we need to add the bounding box of all the components at the begining of the GeometryTree to get a felling on performance.

talevy · 2019-04-11T04:42:14Z

@iverase thanks. I've added the GeometryTree's extent so that it can be used to filter out non-matches or trivial matches. I'm currently working on benchmarking for this

…v-geoshape

talevy · 2019-04-11T14:37:50Z

to keep this PR smaller, and introduce fewer "TODO" items, I've removed the GeometryTree from this PR.

talevy · 2019-04-16T23:18:13Z

Closing to re-evaluate implementation. further investigations necessary before settling on exact data structure

…v-geoshape

talevy · 2019-05-01T18:08:23Z

I've re-opened this PR since we have now decided to continue forward with this implementation!

…v-geoshape

talevy · 2019-05-07T22:42:29Z

run elasticsearch-ci/packaging-sample

nknize

I'm +1... this is looking good. It's a feature branch so I'm not opposed looking into using Lucene GeoTestUtil in a future PR.

nknize · 2019-05-21T15:24:05Z

server/src/test/java/org/elasticsearch/common/geo/EdgeTreeTests.java

+
+    public void testRectangleShape() throws IOException {
+        for (int i = 0; i < 1000; i++) {
+            int minX = randomIntBetween(-180, 170);


Can we leverage lucene's GeoTestUtil? There are test methods that tend to create adversarial rectangles that either cross dateline (nextBox) or not (nextBoxNotCrossingDateline)?

nknize · 2019-05-21T15:25:46Z

server/src/test/java/org/elasticsearch/common/geo/EdgeTreeTests.java

+
+    public void testSimplePolygon() throws IOException  {
+        for (int iter = 0; iter < 1000; iter++) {
+            ShapeBuilder builder = RandomShapeGenerator.createShape(random(), RandomShapeGenerator.ShapeType.POLYGON);


Since we're just creating Lucene Polygon, I think we could also use Lucene's GeoTestUtil here so we don't have to rely on the builder. GeoTestUtil.nextPolygon() will create a lot of adversarial cases for us to stress test the edge tree.

talevy · 2019-05-21T18:08:15Z

thanks Nick! I will look into GeoTestUtil usage in a follow-up

This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon

talevy added WIP :Analytics/Geo Indexing, search aggregations of geo points and shapes labels Apr 3, 2019

talevy force-pushed the tl-dv-geoshape branch 4 times, most recently from 7c18053 to 4d87465 Compare April 4, 2019 21:52

jpountz reviewed Apr 5, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java Outdated Show resolved Hide resolved

iverase reviewed Apr 5, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/common/geo/LinearRingEdgeTreeReader.java Outdated Show resolved Hide resolved

talevy added 3 commits April 5, 2019 14:34

introduce EdgeTree writer and reader

9eabd12

This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon

next step

bcc7b68

iterate and add geometry tree

8135bf1

talevy force-pushed the tl-dv-geoshape branch from 4d87465 to 8135bf1 Compare April 9, 2019 23:38

Merge remote-tracking branch 'upstream/master' into tl-dv-geoshape

2cddf8a

check style

ab2fade

add extent to geometry-tree

b3df270

talevy added 2 commits April 11, 2019 07:35

remove geometry-tree and add some comments

9513b0a

Merge remote-tracking branch 'upstream/geoshape-doc-values' into tl-d…

6e38f5a

…v-geoshape

talevy removed the WIP label Apr 11, 2019

talevy marked this pull request as ready for review April 11, 2019 14:38

talevy requested review from iverase and nknize April 11, 2019 14:38

imotov mentioned this pull request Apr 11, 2019

Doc values support for geo shapes. #37206

Closed

revert mapper change

e615a1c

talevy closed this Apr 16, 2019

talevy reopened this May 1, 2019

Merge remote-tracking branch 'upstream/geoshape-doc-values' into tl-d…

cf57f71

…v-geoshape

talevy requested review from imotov, nknize and iverase and removed request for nknize and iverase May 1, 2019 18:07

talevy added 4 commits May 1, 2019 14:30

Merge remote-tracking branch 'upstream/geoshape-doc-values' into tl-d…

088bb1d

…v-geoshape

Merge remote-tracking branch 'upstream/geoshape-doc-values' into tl-d…

9a839bf

…v-geoshape

Merge remote-tracking branch 'upstream/geoshape-doc-values' into tl-d…

41e9135

…v-geoshape

update with latest lucene

9959c09

nknize approved these changes May 21, 2019

View reviewed changes

talevy merged this pull request into elastic:geoshape-doc-values May 21, 2019

talevy deleted the tl-dv-geoshape branch May 21, 2019 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce EdgeTree writer and reader #40810

introduce EdgeTree writer and reader #40810

talevy commented Apr 3, 2019 •

edited

Loading

elasticmachine commented Apr 3, 2019

jpountz left a comment

iverase left a comment

talevy commented Apr 5, 2019

talevy commented Apr 5, 2019

jpountz commented Apr 5, 2019

talevy commented Apr 10, 2019

jpountz commented Apr 10, 2019

iverase commented Apr 10, 2019

talevy commented Apr 11, 2019

talevy commented Apr 11, 2019

talevy commented Apr 16, 2019

talevy commented May 1, 2019

talevy commented May 7, 2019

nknize left a comment •

edited

Loading

nknize May 21, 2019

nknize May 21, 2019

talevy commented May 21, 2019

introduce EdgeTree writer and reader #40810

introduce EdgeTree writer and reader #40810

Conversation

talevy commented Apr 3, 2019 • edited Loading

elasticmachine commented Apr 3, 2019

jpountz left a comment

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

talevy commented Apr 5, 2019

talevy commented Apr 5, 2019

jpountz commented Apr 5, 2019

talevy commented Apr 10, 2019

jpountz commented Apr 10, 2019

iverase commented Apr 10, 2019

talevy commented Apr 11, 2019

talevy commented Apr 11, 2019

talevy commented Apr 16, 2019

talevy commented May 1, 2019

talevy commented May 7, 2019

nknize left a comment • edited Loading

Choose a reason for hiding this comment

nknize May 21, 2019

Choose a reason for hiding this comment

nknize May 21, 2019

Choose a reason for hiding this comment

talevy commented May 21, 2019

talevy commented Apr 3, 2019 •

edited

Loading

nknize left a comment •

edited

Loading