indexing #173

morganherlocker · 2014-12-30T03:49:24Z

So far I have rejected anything related to indexing. I think of it as a problem for the data layer, rather than the processing layer. That said, I think it should be considered.

pros

we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).
it is useful functionality, and many users will be unable to figure out how to use index modules in conjunction with turf effectively on their own
several indexes (s2 cells, tile quadkeys, and geohashes) are data independent, so they scale across clusters (r-tree, for example, does not)

cons

I am not aware of an existing convention for indexes in GeoJSON, besides the id field, which is flimsy for this
multiple types of indexes on the same dataset might clash
some common indexing schemes do not scale well in clusters (r-tree)
the fact that users have a hard time understanding indexes might just mean this allows them to do dumb things faster 😄
scalable indexes tend to rely on "universal axioms" about planet earth; turf should work work gracefully without these contextual assumptions (turf is not currently perfect on this point: units of measurement, WGS84 encoding, assumptions about precision with 32bit floats)

candidates for submodules

tile-cover: fast and scales across clusters
rbush: fast but each box needs the index in memory (this might not matter, depending on some architecture decisions we make)
geohash: fast and scalable, but does not work on lines or polygons (although maybe it would work fine with geohashed bbox sw & ne points)
tilebelt: minimum bbox quadkey would be fast and scalable (same issues as geohash)

I am 100% on the fence with this. Input greatly appreciated from anyone who understands how indexing works at a low level and people who just need to use them). This has huge architecture implications, so we really need both sides.

tmcw · 2014-12-30T03:55:01Z

I've tinkered with this with the revisions to turf-count and making turf-extent wayfaster - basically for a bunch of algorithms where we repeatedly do something expensive like point-in-polygon, doing point-in-extent first gives a serious boost. But the algorithms that benefit from this, or any other indexing strategy, seem somewhat limited for now.

we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).

Only for repeated checking of the same polygon - the initial processing of the index would be more expensive than the current pip algorithm.

From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

morganherlocker · 2014-12-30T05:00:21Z

@tmcw automatic indexes on the fly are probably bad (without some finicky heuristics as to when they should happen). What if indexing required a manual call (something like var myPoly = index(myPoly, 'quadkey')), that left an index embedded in the data. This would be checked for in turf subs that could use it.

From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

Agree with this 100% from an advanced user's perspective; people who can chop processing into reasonable chunks using indexes. I am thinking more about someone trying to find which of a million points are within a polygon. Right now, a beginner is going to be using turf like PostGIS minus indexes (unusably slow in common real world situations).

For example, what if turf-inside looked for a tile index? If its there, then it does a trivial in-out quadkey check. If not, then it goes ahead with the full algorithm. This might be close to best-of-both-worlds.

tmcw · 2014-12-30T16:02:52Z

I am thinking more about someone trying to find which of a million points are within a polygon.

I think with the small optimizations I've made for this will give a pretty good bump - I'll try this out later today.

tmcw · 2015-01-23T17:47:17Z

see also: https://github.com/mikolalysenko/box-intersect-benchmark

jfirebaugh · 2015-02-19T23:35:08Z

If there are big unknowns or tradeoffs in the choice of index datastructure, would it be possible to define algorithms like turf-inside in terms of an index protocol rather than a specific concrete implementation? Turf would then provide implementations of the protocol for rbush, tile-cover, tilebelt, etc, and users would supply their choice when instantiating a particular algorithm.

stepankuzmin · 2015-09-21T13:14:12Z

Hi all! +1 for protocols as @jfirebaugh mentioned (like rbush implementation with https://github.com/jvrousseau/turf-index).

DenisCarriere · 2017-03-07T20:43:45Z

An old issue, however very useful topic!

So far I've been using RBush for everything index related and it is VERY efficient on extremely large GeoJSON datasets.

I had to create my own "Ad Hoc" GeoJSON implementation in @turf/line-intersect, this example is a bit TOO complex for what we would need.

It would be useful to have a basic GeoJSON support implementation of RBush called @turf/index.

We would need to add some extra default properties while loading the RBush Tree to be able to retrieve the GeoJSON again (index & geojson) and yes this would destroy your memory if you are handling large Objects... (open for suggestions).

tree.insert({
    minX: minX,
    minY: minY,
    maxX: maxX,
    maxY: maxY,
    index: index,
    geojson: geojson
})

It would @returns {RBush Tree}, that way TurfJS only needs to implement loading GeometryCollection, FeatureCollection & Feature.

DenisCarriere · 2017-04-08T17:32:01Z

Since there are many different ways to implement an index (r-tree, geohashes, etc...). It's best to leave these types of index operations/implementations outside of TurfJS.

For example, the r-tree index implementation rbush or geojson-rbush can be used.

Without going too much outside the TurfJS scope, we could include the bbox attribute to the output results for a few modules (ex: @turf/line-segment) without sacrificing too much performance loss.

The benefit of having this bbox attribute already existing to each feature would reduce significantly the number of @turf/bbox calculations done when building an geohash/ r-tree index (in the case that the geometry changes, then the bbox attribute would have to be re-calculated).
http://geojson.org/geojson-spec.html#bounding-boxes

GeoJSON BBox attribute example

  { "type": "Feature",
    "properties": {},
    "bbox": [-10.0, -10.0, 10.0, 10.0],
    "geometry": {
      "type": "Polygon",
      "coordinates": [[
        [-10.0, -10.0], [10.0, -10.0], [10.0, 10.0], [-10.0, 10.0]
        ]]
      }
    ...
    }

@tmcw @morganherlocker Fair to say we can close this issue?

morganherlocker added question new-feature labels Dec 30, 2014

morganherlocker mentioned this issue Dec 30, 2014

feature request: geohash #168

Closed

jfirebaugh mentioned this issue Feb 19, 2015

use turf for hit testing in feature_tree.js mapbox/mapbox-gl-js#1011

Closed

jvrousseau mentioned this issue Mar 18, 2015

knn turf-junkyard/turf-nearest#4

Closed

tmcw mentioned this issue Apr 24, 2016

optional k parameter turf-junkyard/turf-nearest#1

Closed

DenisCarriere added the new-module label Mar 7, 2017

DenisCarriere mentioned this issue Mar 10, 2017

New module @turf/rbush #609

Closed

9 tasks

DenisCarriere added wontfix and removed new-feature new-module labels Apr 8, 2017

morganherlocker closed this as completed Apr 10, 2017

DenisCarriere mentioned this issue Apr 17, 2017

@turf/inside performance increase #675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexing #173

indexing #173

morganherlocker commented Dec 30, 2014

tmcw commented Dec 30, 2014

morganherlocker commented Dec 30, 2014

tmcw commented Dec 30, 2014

tmcw commented Jan 23, 2015

jfirebaugh commented Feb 19, 2015

stepankuzmin commented Sep 21, 2015

DenisCarriere commented Mar 7, 2017

DenisCarriere commented Apr 8, 2017

indexing #173

indexing #173

Comments

morganherlocker commented Dec 30, 2014

pros

cons

candidates for submodules

tmcw commented Dec 30, 2014

morganherlocker commented Dec 30, 2014

tmcw commented Dec 30, 2014

tmcw commented Jan 23, 2015

jfirebaugh commented Feb 19, 2015

stepankuzmin commented Sep 21, 2015

DenisCarriere commented Mar 7, 2017

DenisCarriere commented Apr 8, 2017