Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce EdgeTree writer and reader #40810

Merged
merged 14 commits into from
May 21, 2019

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Apr 3, 2019

This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon. In this first pass, only polygons without holes are supported

Using an OSM dataset of 27,105,113 valid linear rings, I went ahead
and indexed them using the implementation that has a GeometryTree with a bounding box
of the inner EdgeTree. more optimizations and reduction in abstraction/redundancy when only one EdgeTree exists can be done.

doc mapping

            "properties": {
              "points": {
                "type": "geo_point"
              },
              "shape": {
                "type": "geo_shape"
              }
            }

example document

{"points":[[-160.50130570000002, 55.3354488], [-160.5014444, 55.3352878], [-160.5015342, 55.335312800000004], [-160.50161590000002, 55.3352179], [-160.501349, 55.3351435], [-160.5012595, 55.3352474], [-160.50132630000002, 55.335266000000004], [-160.5011954, 55.335418100000005], [-160.50130570000002, 55.3354488]],"shape":{"type":"Polygon","coordinates":[[[-160.50130570000002, 55.3354488], [-160.5014444, 55.3352878], [-160.5015342, 55.335312800000004], [-160.50161590000002, 55.3352179], [-160.501349, 55.3351435], [-160.5012595, 55.3352474], [-160.50132630000002, 55.335266000000004], [-160.5011954, 55.335418100000005], [-160.50130570000002, 55.3354488]]]}}

results from rally:

|   Lap |                          Metric |   Task |   Value |   Unit |
|------:|--------------------------------:|-------:|--------:|-------:|
|   All |              Total Young Gen GC |        |       0 |      s |
|   All |                Total Old Gen GC |        |       0 |      s |
|   All |                  Min Throughput |   bulk |  571.08 | docs/s |
|   All |               Median Throughput |   bulk | 1951.59 | docs/s |
|   All |                  Max Throughput |   bulk | 3436.21 | docs/s |
|   All |         50th percentile latency |   bulk | 76.1577 |     ms |
|   All |         90th percentile latency |   bulk | 165.303 |     ms |
|   All |         99th percentile latency |   bulk | 1780.44 |     ms |
|   All |       99.9th percentile latency |   bulk | 3409.94 |     ms |
|   All |      99.99th percentile latency |   bulk | 9411.48 |     ms |
|   All |        100th percentile latency |   bulk | 60022.5 |     ms |
|   All |    50th percentile service time |   bulk | 76.1577 |     ms |
|   All |    90th percentile service time |   bulk | 165.303 |     ms |
|   All |    99th percentile service time |   bulk | 1780.44 |     ms |
|   All |  99.9th percentile service time |   bulk | 3409.94 |     ms |
|   All | 99.99th percentile service time |   bulk | 9411.48 |     ms |
|   All |   100th percentile service time |   bulk | 60022.5 |     ms |
|   All |                      error rate |   bulk |   22.91 |      % |

disk usage report:

total disk:     24,900,881,453
num docs:           27,105,113
stored fields:   5,197,206,742
term vectors:                0
norms:                       0
docvalues:      11,765,499,000
postings:            1,727,908
prox:                        0
points:          7,651,749,909
terms:             284,696,124

        field           total      terms dict        postings       proximity          points       docvalues       % with dv                       features
        =====           =====      ==========        ========       =========       =========       =========        ========                       ========
        shape  14,117,465,871               0               0               0   4,992,899,775   9,124,566,096          100.0%               4bytes/7D binary
       points   5,183,648,448               0               0               0   2,601,226,493   2,582,421,955          100.0%       4bytes/2D sorted_numeric
          _id     284,696,099     284,695,835             110               0               0             154            0.0%                           docs
      _seq_no     114,306,603               0               0               0      55,796,578      58,510,025          100.0%              8bytes/1D numeric
 _field_names       1,728,241             289       1,727,798               0               0             154            0.0%                           docs
_primary_term             231               0               0               0               0             231          100.0%                        numeric
     _version             231               0               0               0               0             231          100.0%                        numeric
      _source             154               0               0               0               0             154            0.0%                               

@talevy talevy added WIP :Analytics/Geo Indexing, search aggregations of geo points and shapes labels Apr 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@talevy talevy force-pushed the tl-dv-geoshape branch 4 times, most recently from 7c18053 to 4d87465 Compare April 4, 2019 21:52
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start! It'b be nice to take advantage of some invariants to improve space-efficiency. For instance instead of recording maxY as an int, we could maybe record maxY-minY as a vint? I'm sure there are other similar things we could do.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jpountz it is a great start. Note that I have been refactoring the Lucene implementation and I think this implementation can benefit from it. If you want, have look at this PR:

apache/lucene-solr#627

@talevy
Copy link
Contributor Author

talevy commented Apr 5, 2019

thanks for the reviews @jpountz and @iverase! I will follow-up with another pass

@talevy
Copy link
Contributor Author

talevy commented Apr 5, 2019

@jpountz regarding:

For instance instead of recording maxY as an int, we could maybe record maxY-minY as a vint? I'm sure there are other similar things we could do.

My goal in using consistent sizing of the integers was to have the ability to skip serializing of tree branches if they were unnecessary. If the sizes of each node are variable, it makes this difficult.
Is there a way to have both? or is one more important than the other?

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2019

This is a good question, I think the answer is going to depend on how much we can save. If only a couple percents, then it's probably not worth doing. However if simple compression can achieve significant savings then I'm sure that the question will come in the future again and we will have to deal with backward compatibility, which is why I'd like to raise this question now. It will indeed make serialization harder.

talevy added 3 commits April 5, 2019 14:34
This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon
@talevy
Copy link
Contributor Author

talevy commented Apr 10, 2019

Hi @jpountz, Since which format we take is dependent on further testing with
real-data, I'd like to first assess whether this is a good start for the feature-branch and
then we can iterate on performance changes on the branch.

I'm working, on a separate branch, to test the performance of this simple GeometryTree that only supports polygons without holes, and stores multi-polygons in a list, rather than a KDTree. I'd like to keep things simple for now and build on the structure in steps. The reason I introduced the GeometryTree was to show how the EdgeTree would be created from the Geometry object that is being indexed.

let me know what you think!

@jpountz
Copy link
Contributor

jpountz commented Apr 10, 2019

I'll defer to @nknize and @iverase. :)

@iverase
Copy link
Contributor

iverase commented Apr 10, 2019

The first thing it comes to mind is that this structure will not be efficient for Geo Bounds aggregation or Geo Centroid aggregation. I think at least we need to add the bounding box of all the components at the begining of the GeometryTree to get a felling on performance.

@talevy
Copy link
Contributor Author

talevy commented Apr 11, 2019

@iverase thanks. I've added the GeometryTree's extent so that it can be used to filter out non-matches or trivial matches. I'm currently working on benchmarking for this

@talevy
Copy link
Contributor Author

talevy commented Apr 11, 2019

to keep this PR smaller, and introduce fewer "TODO" items, I've removed the GeometryTree from this PR.

@talevy talevy removed the WIP label Apr 11, 2019
@talevy talevy marked this pull request as ready for review April 11, 2019 14:38
@talevy talevy requested review from iverase and nknize April 11, 2019 14:38
@talevy
Copy link
Contributor Author

talevy commented Apr 16, 2019

Closing to re-evaluate implementation. further investigations necessary before settling on exact data structure

@talevy talevy closed this Apr 16, 2019
@talevy talevy reopened this May 1, 2019
@talevy talevy requested review from imotov, nknize and iverase and removed request for nknize and iverase May 1, 2019 18:07
@talevy
Copy link
Contributor Author

talevy commented May 1, 2019

I've re-opened this PR since we have now decided to continue forward with this implementation!

@talevy
Copy link
Contributor Author

talevy commented May 7, 2019

run elasticsearch-ci/packaging-sample

Copy link
Contributor

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1... this is looking good. It's a feature branch so I'm not opposed looking into using Lucene GeoTestUtil in a future PR.


public void testRectangleShape() throws IOException {
for (int i = 0; i < 1000; i++) {
int minX = randomIntBetween(-180, 170);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage lucene's GeoTestUtil? There are test methods that tend to create adversarial rectangles that either cross dateline (nextBox) or not (nextBoxNotCrossingDateline)?


public void testSimplePolygon() throws IOException {
for (int iter = 0; iter < 1000; iter++) {
ShapeBuilder builder = RandomShapeGenerator.createShape(random(), RandomShapeGenerator.ShapeType.POLYGON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just creating Lucene Polygon, I think we could also use Lucene's GeoTestUtil here so we don't have to rely on the builder. GeoTestUtil.nextPolygon() will create a lot of adversarial cases for us to stress test the edge tree.

@talevy
Copy link
Contributor Author

talevy commented May 21, 2019

thanks Nick! I will look into GeoTestUtil usage in a follow-up

@talevy talevy merged this pull request into elastic:geoshape-doc-values May 21, 2019
@talevy talevy deleted the tl-dv-geoshape branch May 21, 2019 18:08
talevy added a commit that referenced this pull request May 21, 2019
This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon
talevy added a commit that referenced this pull request Sep 20, 2019
This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants